Senior Site Reliability Engineer

ClickHouse 30 days ago

Remote

Canada

Senior Level

Top Benefits

Healthcare coverage with employer contributions

Equity: stock options for new hires

Flexible remote work across 20 countries

About the role

Who you are

Bachelor’s or Master’s degree in Computer Science or a related field
At least 8 years of experience in Site Reliability Engineering or a related field
Hands-on experience with Go and/or Python
Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus
Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
You are a strong problem solver and have solid production debugging skills
You are passionate about efficiency, availability, scalability, and data governance
You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
You have a high level of responsibility, ownership, and accountability
Excellent communication and interpersonal skills

What the job involves

We are committed to providing our customers with reliable and secure services so we are expanding our central Site Reliability Engineering team
You will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure
You will collaborate with different teams like Control Plane, Data Plane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems
You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our Cloud services
You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud
This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse Cloud
Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents
Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
Continuously improve the reliability and performance of our ClickHouse services
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime

Benefits

Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
Healthcare - Employer contributions towards your healthcare
Equity in the company - Every new team member who joins our company receives stock options
Time off - Flexible time off in the US, generous entitlement in other countries
A $500 Home office setup if you’re a remote employee
Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

Not the right fit? Search for Site Reliability Engineer jobs in Canada

About ClickHouse

Software Development

51-200

ClickHouse is an open-source, column-oriented OLAP database management system that allows users to generate analytical reports using SQL queries in real-time. Its technology works 100-1000x faster than traditional database management systems, and processes hundreds of millions to over a billion rows and tens of gigabytes of data per server per second. With a widespread user base around the globe, the technology has received praise for its reliability, ease of use, and fault tolerance. Learn more at clickhouse.com.

Bay Area, USA | Amsterdam, The Netherlands

Website LinkedIn

Similar jobs you might like