Jobs.ca
Jobs.ca
Language
Tyk Technologies logo

Site Reliability Engineer

Tyk Technologies14 days ago
Remote
Canada
Mid Level

Top Benefits

Unlimited paid holiday
Flexible working hours
Employee share scheme

About the role

Who you are

  • Strong collaboration skills
  • Launching and operating production scale kubernetes clusters
  • Designing and operating infrastructure on AWS and other providers
  • Operating MongoDB (or other document database) clusters
  • Operating Redis (or other key-value storage) clusters
  • Administering Linux servers
  • Maintaining distributed software
  • Operating Prometheus and Grafana
  • Operating logging collection and analysis systems
  • Participating in the on-call rotation(16:00pm – 4:00am UTC)
  • Kubernetes & containers (advanced)
  • AWS / EKS (advanced)
  • Linux (advanced)
  • Terraform and IaC in general (proficient)
  • Helm (proficient)
  • Go and/or Python (familiar)
  • MongoDB (or similar)
  • Redis (or similar)
  • Monitoring – prometheus, grafana, thanos (familiar)
  • Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
  • Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)
  • Proactive, energetic, innovative and change oriented
  • GCP or Azure
  • Bare metal infrastructure engineering
  • API management experience
  • Large scale distributed storage management
  • Familiarity with Rancher
  • CKA/CKAD/CKS
  • Creating and delivering production software in Go language

What the job involves

  • We’re looking for a Site Reliability Engineer to manage, maintain, improve and provide support on our platform
  • You will be curious by nature, always looking for ways to improve, as we will look to you for new ideas, solutions and metrics on how we can improve the platform
  • You will also be our first line of incident management to our clients and will help define our response going forward
  • This is a great opportunity to become an integral part of Tyk as we continue on our journey
  • As a remote first company, you will have the opportunity to work with an industry leading distributed team
  • Having access to expertise from across the globe will give you both the support and opportunity to help shape not only Tyk’s Cloud platform but also the Tyk as a whole as we continue to grow
  • Maintaining global Tyk Cloud within SL(A/I/O)s you will help to define
  • Identifying reliability issues and working together with your squad to solve them
  • Identifying and introducing new metrics and building relevant dashboards
  • Participating in the on-call rotation
  • Working with your squad to expand multi-region and multi-cloud reach of the platform
  • Documenting operational knowledge
  • Conducting post-incident analysis
  • Automating common tasks
  • Be a key shaper and contributor to our continuous improvement agenda – be it the clarity of our user stories, how we estimate, communicate with other teams or customers – we expect this role to be advocate of continuous improvement
  • Reliability of our new global Tyk Cloud platform
  • Automation of operations and support
  • Writing and maintaining documentation on SRE processes and policies
  • Recommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting service
  • Assisting in penetration testing for Cloud through liaising with our provider, providing technical details, and environment setup
  • Incident management

Benefits

  • Everyone has unlimited paid holiday
  • We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all
  • Employee share scheme
  • Generous maternity and paternity leave
  • Company retreats
  • Volunteering Days
  • Employee Wellbeing platform

About Tyk Technologies

IT Services and IT Consulting