Site Reliability Engineer

Tyk Technologies 4 months ago

Remote

Canada

Mid Level

Top Benefits

Unlimited paid holiday

Flexible working hours

Employee share scheme

About the role

Who you are

Strong collaboration skills
Launching and operating production scale kubernetes clusters
Designing and operating infrastructure on AWS and other providers
Operating MongoDB (or other document database) clusters
Operating Redis (or other key-value storage) clusters
Administering Linux servers
Maintaining distributed software
Operating Prometheus and Grafana
Operating logging collection and analysis systems
Participating in the on-call rotation(16:00pm – 4:00am UTC)
Kubernetes & containers (advanced)
AWS / EKS (advanced)
Linux (advanced)
Terraform and IaC in general (proficient)
Helm (proficient)
Go and/or Python (familiar)
MongoDB (or similar)
Redis (or similar)
Monitoring – prometheus, grafana, thanos (familiar)
Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)
Proactive, energetic, innovative and change oriented
GCP or Azure
Bare metal infrastructure engineering
API management experience
Large scale distributed storage management
Familiarity with Rancher
CKA/CKAD/CKS
Creating and delivering production software in Go language

What the job involves

We’re looking for a Site Reliability Engineer to manage, maintain, improve and provide support on our platform
You will be curious by nature, always looking for ways to improve, as we will look to you for new ideas, solutions and metrics on how we can improve the platform
You will also be our first line of incident management to our clients and will help define our response going forward
This is a great opportunity to become an integral part of Tyk as we continue on our journey
As a remote first company, you will have the opportunity to work with an industry leading distributed team
Having access to expertise from across the globe will give you both the support and opportunity to help shape not only Tyk’s Cloud platform but also the Tyk as a whole as we continue to grow
Maintaining global Tyk Cloud within SL(A/I/O)s you will help to define
Identifying reliability issues and working together with your squad to solve them
Identifying and introducing new metrics and building relevant dashboards
Participating in the on-call rotation
Working with your squad to expand multi-region and multi-cloud reach of the platform
Documenting operational knowledge
Conducting post-incident analysis
Automating common tasks
Be a key shaper and contributor to our continuous improvement agenda – be it the clarity of our user stories, how we estimate, communicate with other teams or customers – we expect this role to be advocate of continuous improvement
Reliability of our new global Tyk Cloud platform
Automation of operations and support
Writing and maintaining documentation on SRE processes and policies
Recommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting service
Assisting in penetration testing for Cloud through liaising with our provider, providing technical details, and environment setup
Incident management

Benefits

Everyone has unlimited paid holiday
We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all
Employee share scheme
Generous maternity and paternity leave
Company retreats
Volunteering Days
Employee Wellbeing platform

About Tyk Technologies

IT Services and IT Consulting

Website LinkedIn

Site Reliability Engineer

Tyk Technologies 4 months ago

Remote

Canada

Mid Level

Top Benefits

Unlimited paid holiday

Flexible working hours

Employee share scheme

About the role

Who you are

Strong collaboration skills
Launching and operating production scale kubernetes clusters
Designing and operating infrastructure on AWS and other providers
Operating MongoDB (or other document database) clusters
Operating Redis (or other key-value storage) clusters
Administering Linux servers
Maintaining distributed software
Operating Prometheus and Grafana
Operating logging collection and analysis systems
Participating in the on-call rotation(16:00pm – 4:00am UTC)
Kubernetes & containers (advanced)
AWS / EKS (advanced)
Linux (advanced)
Terraform and IaC in general (proficient)
Helm (proficient)
Go and/or Python (familiar)
MongoDB (or similar)
Redis (or similar)
Monitoring – prometheus, grafana, thanos (familiar)
Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.)
Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP)
Proactive, energetic, innovative and change oriented
GCP or Azure
Bare metal infrastructure engineering
API management experience
Large scale distributed storage management
Familiarity with Rancher
CKA/CKAD/CKS
Creating and delivering production software in Go language

What the job involves

We’re looking for a Site Reliability Engineer to manage, maintain, improve and provide support on our platform
You will be curious by nature, always looking for ways to improve, as we will look to you for new ideas, solutions and metrics on how we can improve the platform
You will also be our first line of incident management to our clients and will help define our response going forward
This is a great opportunity to become an integral part of Tyk as we continue on our journey
As a remote first company, you will have the opportunity to work with an industry leading distributed team
Having access to expertise from across the globe will give you both the support and opportunity to help shape not only Tyk’s Cloud platform but also the Tyk as a whole as we continue to grow
Maintaining global Tyk Cloud within SL(A/I/O)s you will help to define
Identifying reliability issues and working together with your squad to solve them
Identifying and introducing new metrics and building relevant dashboards
Participating in the on-call rotation
Working with your squad to expand multi-region and multi-cloud reach of the platform
Documenting operational knowledge
Conducting post-incident analysis
Automating common tasks
Be a key shaper and contributor to our continuous improvement agenda – be it the clarity of our user stories, how we estimate, communicate with other teams or customers – we expect this role to be advocate of continuous improvement
Reliability of our new global Tyk Cloud platform
Automation of operations and support
Writing and maintaining documentation on SRE processes and policies
Recommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting service
Assisting in penetration testing for Cloud through liaising with our provider, providing technical details, and environment setup
Incident management

Benefits

Everyone has unlimited paid holiday
We have total flexibility in hours, as we believe creativity flows better when our people are given freedom to decide when they are most productive. Everyone is unique after all
Employee share scheme
Generous maternity and paternity leave
Company retreats
Volunteering Days
Employee Wellbeing platform

About Tyk Technologies

IT Services and IT Consulting

Website LinkedIn