Senior Site Reliability Engineer
Remote
Toronto
CA$103,414 - CA$144,836/yearly
Senior Level
Top Benefits
Flexible time off: Take time for yourself and priorities.
Stock options: Share in company success.
U-First Fridays: 4 hours/month for learning.
About the role
Who you are
- If you don’t think you meet all of the criteria below but are still interested in the job, please apply. Nobody checks every box - we’re looking for candidates that are particularly strong in a few areas, and have some interest and capabilities in others
- This is a hands-on role ideal for engineers who thrive on running production SaaS systems at scale, automating operations, and continuously improving performance, resilience, and deployment pipelines
- BS in Computer Science or equivalent practical experience
- Proven experience managing SaaS or PaaS systems at enterprise scale (multi-region, multi-tenant, secure environments)
- Deep expertise in Kubernetes, including debugging cluster/networking issues and designing for fault tolerance and scalability
- Strong proficiency with Infrastructure as Code tools like Terraform or Terragrunt
- Experience with CI/CD pipelines and GitOps workflows (ArgoCD, Atlantis, Helm)
- Proficiency in one or more programming languages (Go, Python, Bash) for automation and tooling
- Solid understanding of Linux/Unix systems, networking (DNS, TLS/SSL, HTTP), load balancers and distributed systems
- Experiencing working with API gateway and service mesh technologies
- Familiarity with streaming systems like Kafka and observability platforms (Datadog, Prometheus, Grafana)
- Experience working in a 24/7/365 production support environment
- Hands-on experience with Kong Gateway, Kong Mesh, or similar service connectivity technologies
- Experience operating ClickHouse, Druid, or other time-series and analytics databases
- Experience managing PostgreSQL and Redis in multi-region configurations
- Working knowledge of AWS networking (PrivateLink, Transit Gateway, VPC Peering, Firewalls), Azure VNet, or GCP NCC
- Strong understanding of disaster recovery, resiliency testing, and compliance-driven reliability practices
What the job involves
- As a Site Reliability Engineer, you’ll join the global Platform SRE team responsible for building, operating, and scaling Kong’s multi-region SaaS platform that powers the world’s API connectivity
- You’ll design, automate, and run production systems serving thousands of customers across AWS, GCP, and Azure
- You’ll work on everything from multi-region Kubernetes clusters to service mesh and gateway architectures, ensuring the reliability, scalability, and security of Kong’s SaaS offerings
- Operate and scale Kong’s global SaaS platform (Konnect), ensuring reliability, availability, and performance across regions and clouds
- Build, automate, and maintain Kubernetes-based infrastructure and deployment workflows using Terraform/Terragrunt, Helm, and ArgoCD
- Design, maintain, and optimize multi-region data and caching layers — including PostgreSQL, Redis, ClickHouse, and Druid — for high availability and low latency
- Operate and improve Kong Gateway and Kong Mesh environments supporting hybrid and distributed architectures
- Develop and maintain CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure changes
- Enhance observability and incident response readiness through systems like Datadog, Prometheus, Grafana, and Thanos, defining and tracking SLOs
- Collaborate closely with development and security teams to ensure smooth operation of SaaS services in compliance with reliability, security, and regulatory standards
- Participate in a global 24/7 on-call rotation and drive continuous improvement of operational playbooks and postmortem practices
- Lead and contribute to scaling initiatives that improve elasticity, reliability, and cost-efficiency across the SaaS platform
Benefits
- Flexible time off: Take time to take care of yourself and the things that matter most
- Stock options: We want you to share in our success. That's why stock options are offered to most Kongers
- U-First Fridays: Get 4 hours a month for continuous learning with a book, podcast, or course of your choice
- Virtual events: Stay connected with Donut chats, trivia, fitness challenges, guided meditations, and more
- Dedicated unplug days: Silence those notifications. Enjoy some well-deserved long weekend where the entire team unplugs
- Home office stipend: Build a home office environment tailored to support your productivity
About Kong Inc.
Software Development
501-1000
Powering the API World. No AI without APIs.
Kong enables any company to become an API-first company. Kong’s unified cloud native API platform is easy to use and works in any environment — unleashing developer productivity, automating security, and boosting performance of APIs and microservices at scale.
Senior Site Reliability Engineer
Remote
Toronto
CA$103,414 - CA$144,836/yearly
Senior Level
Top Benefits
Flexible time off: Take time for yourself and priorities.
Stock options: Share in company success.
U-First Fridays: 4 hours/month for learning.
About the role
Who you are
- If you don’t think you meet all of the criteria below but are still interested in the job, please apply. Nobody checks every box - we’re looking for candidates that are particularly strong in a few areas, and have some interest and capabilities in others
- This is a hands-on role ideal for engineers who thrive on running production SaaS systems at scale, automating operations, and continuously improving performance, resilience, and deployment pipelines
- BS in Computer Science or equivalent practical experience
- Proven experience managing SaaS or PaaS systems at enterprise scale (multi-region, multi-tenant, secure environments)
- Deep expertise in Kubernetes, including debugging cluster/networking issues and designing for fault tolerance and scalability
- Strong proficiency with Infrastructure as Code tools like Terraform or Terragrunt
- Experience with CI/CD pipelines and GitOps workflows (ArgoCD, Atlantis, Helm)
- Proficiency in one or more programming languages (Go, Python, Bash) for automation and tooling
- Solid understanding of Linux/Unix systems, networking (DNS, TLS/SSL, HTTP), load balancers and distributed systems
- Experiencing working with API gateway and service mesh technologies
- Familiarity with streaming systems like Kafka and observability platforms (Datadog, Prometheus, Grafana)
- Experience working in a 24/7/365 production support environment
- Hands-on experience with Kong Gateway, Kong Mesh, or similar service connectivity technologies
- Experience operating ClickHouse, Druid, or other time-series and analytics databases
- Experience managing PostgreSQL and Redis in multi-region configurations
- Working knowledge of AWS networking (PrivateLink, Transit Gateway, VPC Peering, Firewalls), Azure VNet, or GCP NCC
- Strong understanding of disaster recovery, resiliency testing, and compliance-driven reliability practices
What the job involves
- As a Site Reliability Engineer, you’ll join the global Platform SRE team responsible for building, operating, and scaling Kong’s multi-region SaaS platform that powers the world’s API connectivity
- You’ll design, automate, and run production systems serving thousands of customers across AWS, GCP, and Azure
- You’ll work on everything from multi-region Kubernetes clusters to service mesh and gateway architectures, ensuring the reliability, scalability, and security of Kong’s SaaS offerings
- Operate and scale Kong’s global SaaS platform (Konnect), ensuring reliability, availability, and performance across regions and clouds
- Build, automate, and maintain Kubernetes-based infrastructure and deployment workflows using Terraform/Terragrunt, Helm, and ArgoCD
- Design, maintain, and optimize multi-region data and caching layers — including PostgreSQL, Redis, ClickHouse, and Druid — for high availability and low latency
- Operate and improve Kong Gateway and Kong Mesh environments supporting hybrid and distributed architectures
- Develop and maintain CI/CD pipelines and GitOps workflows to automate service delivery and ensure consistent infrastructure changes
- Enhance observability and incident response readiness through systems like Datadog, Prometheus, Grafana, and Thanos, defining and tracking SLOs
- Collaborate closely with development and security teams to ensure smooth operation of SaaS services in compliance with reliability, security, and regulatory standards
- Participate in a global 24/7 on-call rotation and drive continuous improvement of operational playbooks and postmortem practices
- Lead and contribute to scaling initiatives that improve elasticity, reliability, and cost-efficiency across the SaaS platform
Benefits
- Flexible time off: Take time to take care of yourself and the things that matter most
- Stock options: We want you to share in our success. That's why stock options are offered to most Kongers
- U-First Fridays: Get 4 hours a month for continuous learning with a book, podcast, or course of your choice
- Virtual events: Stay connected with Donut chats, trivia, fitness challenges, guided meditations, and more
- Dedicated unplug days: Silence those notifications. Enjoy some well-deserved long weekend where the entire team unplugs
- Home office stipend: Build a home office environment tailored to support your productivity
About Kong Inc.
Software Development
501-1000
Powering the API World. No AI without APIs.
Kong enables any company to become an API-first company. Kong’s unified cloud native API platform is easy to use and works in any environment — unleashing developer productivity, automating security, and boosting performance of APIs and microservices at scale.