Jobs.ca
Jobs.ca
Language
ClickHouse logo

Senior Site Reliability Engineer

ClickHouse7 days ago
Remote
Canada
Senior Level

Top Benefits

Healthcare coverage with employer contributions
Equity: stock options for new hires
Flexible remote work across 20 countries

About the role

Who you are

  • Bachelor’s or Master’s degree in Computer Science or a related field
  • At least 8 years of experience in Site Reliability Engineering or a related field
  • Hands-on experience with Go and/or Python
  • Strong knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
  • Excellent understanding of distributed databases and SQL, particularly ClickHouse is a major plus
  • Hands-on experience with container orchestration tools such as Kubernetes or Docker Swarm
  • Strong experience with automation and configuration management tools such as Ansible, Terraform, or Puppet
  • You are a strong problem solver and have solid production debugging skills
  • You are passionate about efficiency, availability, scalability, and data governance
  • You thrive in a fast paced environment, and see yourself as a partner with the business with the shared goal of moving the business forward
  • You have a high level of responsibility, ownership, and accountability
  • Excellent communication and interpersonal skills

What the job involves

  • We are committed to providing our customers with reliable and secure services so we are expanding our central Site Reliability Engineering team
  • You will be responsible for building and leading processes to ensure the reliability, availability, scalability, and performance of our cloud infrastructure
  • You will collaborate with different teams like Control Plane, Data Plane, Core, Security, Support and Operations and guide them to design and implement scalable, secure, highly available and fault-tolerant distributed systems
  • You will also own the areas of incident management and response, post-mortem analysis including running blameless postmortems, and continuous improvement of our Cloud services
  • You will be leveraging your software engineering expertise to develop software platforms and tools to optimize the operational and engineering efficiencies of ClickHouse Cloud
  • This role is a unique opportunity to make a significant impact on our elastic, limitless scale, high-performance ClickHouse Cloud
  • Collaborate with various engineering teams in ClickHouse to design and implement scalable, secure, and highly available systems for ClickHouse
  • Establish and manage service level objectives (SLOs) and service level agreements (SLAs) for ClickHouse Cloud
  • Ensure all the infrastructure components in ClickHouse Cloud (including Data Plane, Control Plane,ClickHouse Core, etc) have monitoring and alerting in place to ensure timely detection and resolution of incidents
  • Enhance and refine incident response processes and post-mortem analysis for any outages in ClickHouse Cloud including working with the support team to communicate to the impacted customers
  • Continuously improve the reliability and performance of our ClickHouse services
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize downtime

Benefits

  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

About ClickHouse

Software Development
51-200

ClickHouse is an open-source, column-oriented OLAP database management system that allows users to generate analytical reports using SQL queries in real-time. Its technology works 100-1000x faster than traditional database management systems, and processes hundreds of millions to over a billion rows and tens of gigabytes of data per server per second. With a widespread user base around the globe, the technology has received praise for its reliability, ease of use, and fault tolerance. Learn more at clickhouse.com.

Bay Area, USA | Amsterdam, The Netherlands

Similar jobs you might like