Top Benefits
About the role
Who you are
- Bachelor’s or Master’s degree in Computer Science or a related field
- At least 8 years of experience in Site Reliability Engineering or a related field
- Hands-on experience with Go and/or Python
- Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
- Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus
- Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
- Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
- You are a strong problem solver and have solid production debugging skills
- You are passionate about efficiency, availability, scalability, and data governance
- You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
- You have a high level of responsibility, ownership, and accountability
- Excellent communication and interpersonal skills
What the job involves
- We are committed to providing our customers with reliable and secure services so we are expanding our central Site Reliability Engineering team
- You will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure
- You will collaborate with different teams like Control Plane, Data Plane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems
- You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our Cloud services
- You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud
- This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse Cloud
- Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
- Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents
- Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
- Continuously improve the reliability and performance of our ClickHouse services
- Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
- Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime
Benefits
- Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
- Healthcare - Employer contributions towards your healthcare
- Equity in the company - Every new team member who joins our company receives stock options
- Time off - Flexible time off in the US, generous entitlement in other countries
- A $500 Home office setup if you’re a remote employee
- Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
Not the right fit? Search for Site Reliability Engineer jobs in Canada
About ClickHouse
ClickHouse is an open-source, column-oriented OLAP database management system that allows users to generate analytical reports using SQL queries in real-time. Its technology works 100-1000x faster than traditional database management systems, and processes hundreds of millions to over a billion rows and tens of gigabytes of data per server per second. With a widespread user base around the globe, the technology has received praise for its reliability, ease of use, and fault tolerance. Learn more at clickhouse.com.
Bay Area, USA | Amsterdam, The Netherlands
Similar jobs you might like
Top Benefits
About the role
Who you are
- Bachelor’s or Master’s degree in Computer Science or a related field
- At least 8 years of experience in Site Reliability Engineering or a related field
- Hands-on experience with Go and/or Python
- Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
- Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus
- Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
- Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
- You are a strong problem solver and have solid production debugging skills
- You are passionate about efficiency, availability, scalability, and data governance
- You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
- You have a high level of responsibility, ownership, and accountability
- Excellent communication and interpersonal skills
What the job involves
- We are committed to providing our customers with reliable and secure services so we are expanding our central Site Reliability Engineering team
- You will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure
- You will collaborate with different teams like Control Plane, Data Plane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems
- You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our Cloud services
- You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud
- This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse Cloud
- Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
- Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
- Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents
- Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
- Continuously improve the reliability and performance of our ClickHouse services
- Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
- Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime
Benefits
- Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
- Healthcare - Employer contributions towards your healthcare
- Equity in the company - Every new team member who joins our company receives stock options
- Time off - Flexible time off in the US, generous entitlement in other countries
- A $500 Home office setup if you’re a remote employee
- Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
Not the right fit? Search for Site Reliability Engineer jobs in Canada
About ClickHouse
ClickHouse is an open-source, column-oriented OLAP database management system that allows users to generate analytical reports using SQL queries in real-time. Its technology works 100-1000x faster than traditional database management systems, and processes hundreds of millions to over a billion rows and tens of gigabytes of data per server per second. With a widespread user base around the globe, the technology has received praise for its reliability, ease of use, and fault tolerance. Learn more at clickhouse.com.
Bay Area, USA | Amsterdam, The Netherlands