Site Reliability Engineer - Kubernetes PlatformPantera Capital • Palo Alto, CA, United States

Site Reliability Engineer - Kubernetes Platform

Pantera Capital • Palo Alto, CA, United States

[job_card.variable_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

About xAI

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

We are seeking a highly skilled Senior Site Reliability Storage Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI’s infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment.

Responsibilities

Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently.
Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads.
Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs.
Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems.
Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible.
Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs.
Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components.
This is an in-person role based in Palo Alto, CA, with up to 25% travel required.

Required Qualifications

5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems.

Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm.

Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.

Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.

Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs / SLOs.

Preferred Qualifications

Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments.

Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience.

Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.

Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges.

Passion for problem-solving and a proactive drive to deliver impactful results.

A sense of adventure and humor to navigate challenges with a positive mindset.

Annual Salary Range

$180,000 - $440,000 USD

Benefits

Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

xAI is an equal opportunity employer.

California Consumer Privacy Act (CCPA) Notice

#J-18808-Ljbffr

[job_alerts.create_a_job]

Site Reliability Engineer • Palo Alto, CA, United States

[internal_linking.related_jobs]

Senior Site Reliability Developer

Oracle • Pleasanton, California, USA

[job_card.full_time]

Solve complex problems related to infrastructure cloud services and build automation to prevent problem recurrence.Design write and deploy software to improve the availability scalability and effic...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States

[job_card.full_time]

Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

SRE : Observability & Network Reliability Lead

PSI Quantum • Palo Alto, CA, United States

[job_card.full_time]

A leading quantum computing company in Palo Alto is seeking a Site Reliability Engineer to ensure their services remain healthy and fast. Responsibilities include defining SLIs / SLOs, maintaining obs...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

[job_card.full_time]

PsiQuantum'smission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has be...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Kubernetes SRE — Storage & Platform Reliability

x.ai • Palo Alto, CA, United States

[job_card.full_time]

A leading AI technology company is seeking a Senior Site Reliability Engineer to focus on designing and optimizing Kubernetes infrastructure. The role involves collaborating with engineering teams t...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Customer Reliability Engineer

Cisco Systems, Inc. • San Jose, CA, United States

[job_card.full_time]

This is a fully remote position open to candidates located in the United States with a strong preference for candidates based on the West Coast, with the ability to work in the Pacific Time Zone.Ap...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Openstack

Fortinet • Sunnyvale, California, United States

[job_card.full_time]

Fortinet is recruiting a Site Reliability Engineer- OPENSTACK to join our FortiStack team.This team is responsible for the management, operation and continued development of our Openstack-based pri...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Observability

Rivian and Volkswagen Group Technologies • Palo Alto, CA, United States

[job_card.full_time]

Senior Site Reliability Engineer (SRE).RivianVW's Data Platform - Production Engineering team.In this role, you will design, implement, and scale robust observability systems to ensure the health, ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer (SRE)

OPPO • Palo Alto, CA, United States

[job_card.full_time]

OPPO US Research Center is seeking a skilled and proactive.Site Reliability Engineer (SRE).In this role, you will be responsible for ensuring the stability, scalability, and performance of our appl...[show_more]

[last_updated.last_updated_30] • [promoted]

Sr. Reliability Engineer (26861)

Supermicro • San Jose, California, United States

[job_card.full_time]

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer, Tesla Energy Services Platform

Tesla • Palo Alto, CA, United States

[job_card.full_time]

Site Reliability Engineer, Tesla Energy Services Platform.What To Expect : Tesla is looking for a Site Reliability Engineer to build a software and hardware platform for Tesla Megapack service and l...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Key2Source • San Leandro, California, USA

[job_card.full_time]

Job Title : Site Reliability Engineer.Location : San Leandro CA (Onsite).Engineering experience or equivalent demonstrated through one or a combination of the following : work experience training mili...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer (L2)

Wave Money • Palo Alto, CA, United States

[job_card.full_time]

Job Location : The Campus, Pun Hlaing Estate, Hlaing Thar Yar Township, Yangon.Working Hours : 8 : 30 AM to 5 : 30 PM, (Monday to Friday). Site Reliability Engineer is to perform daily support and monitor...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Site Reliability Engineer

Grindr • Palo Alto, CA, United States

[job_card.full_time]

Staff Site Reliability Engineer.Get AI-powered advice on this job and more exclusive features.This range is provided by Grindr. Your actual pay will be based on your skills and experience — talk wit...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer (SRE) at OPPO US Research Center Palo Alto, CA

OPPO US Research Center • Palo Alto, CA, United States

[job_card.full_time]

Site Reliability Engineer (SRE) job at OPPO US Research Center.OPPO US Research Center is seeking a skilled and proactive. Site Reliability Engineer (SRE).In this role, you will be responsible for e...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer – Observability & Automation

black.ai • Palo Alto, CA, United States

[job_card.full_time]

A leading quantum computing company is seeking a Site Reliability Engineer to join their OS / Platform team in Palo Alto. This role involves maintaining the health and performance of services through ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Reliability Engineer – Kubernetes & Observability

Net Impact • San Jose, CA, United States

[job_card.full_time]

A leading technology company in San Jose is seeking a Senior Software Engineer focusing on reliability and Kubernetes.This position involves designing and implementing robust systems and frameworks...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Software Engineer, Site Reliability

LinkedIn • Sunnyvale, California, USA

[job_card.full_time]

At LinkedIn our approach to flexible work is centered on trust and optimized for culture connection clarity and the evolving needs of our business. The work location of this role is hybrid meaning i...[show_more]

[last_updated.last_updated_variable_days] • [promoted]