Talent.com
Site Reliability Engineer - Kubernetes Platform
Site Reliability Engineer - Kubernetes PlatformxAI • Palo Alto, CA, US
Site Reliability Engineer - Kubernetes Platform

Site Reliability Engineer - Kubernetes Platform

xAI • Palo Alto, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

Job Description

About xAI

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

We are seeking a highly skilled Site Reliability Engineer to join our mission-driven team, focusing on designing, building, and optimizing Kubernetes clusters across multiple regions. In this role, you will leverage your expertise in Kubernetes orchestration and distributed systems to enhance the reliability, performance, and cost-effectiveness of xAI's infrastructure. You will collaborate closely with engineering teams to deliver robust, scalable solutions that support large-scale AI workloads. The ideal candidate is passionate about automation, observability, and ensuring the integrity of critical systems in a fast-paced, innovative environment.

Responsibilities

  • Develop and optimize software to provision and manage Kubernetes clusters on-premises, enabling xAI to scale efficiently.
  • Enhance the reliability, performance, and cost-effectiveness of Kubernetes infrastructure to support large-scale AI and application workloads.
  • Collaborate with xAI engineers to understand workload requirements and design tailored Kubernetes solutions to meet their needs.
  • Implement robust observability, monitoring, and security practices to ensure the integrity, availability, and confidentiality of critical systems.
  • Manage storage infrastructure using Infrastructure-as-Code (IaC) tools such as Pulumi, Terraform, or Ansible.
  • Drive system reliability through incident management, postmortems, and the definition of clear SLAs and SLOs.
  • Contribute to the Kubernetes stack, including expertise in CNI, CRI, CSI, and related components.
  • This is an in-person role based in Palo Alto, CA, with up to 25% travel required.

Required Qualifications

  • 5+ years of experience as a Site Reliability Engineer or similar role, with a focus on building and maintaining reliable, scalable systems.
  • Proven expertise in managing Kubernetes infrastructure using tools like Cluster API (CAPI) and kubeadm.
  • Proficiency in managing storage infrastructure with IaC tools such as Pulumi, Terraform, or Ansible.
  • Deep understanding of the Kubernetes stack, including CNI, CRI, CSI, and related components.
  • Demonstrated ability to improve system reliability through incident management, postmortems, and defining SLAs / SLOs.
  • Preferred Qualifications

  • Experience with high-traffic web or mobile application workloads, including optimizing Kubernetes for large-scale deployments.
  • Familiarity with chaos engineering, capacity planning, or similar practices for ensuring system resilience.
  • Proficiency with tools such as Kyverno, ArgoCD, or Go programming for infrastructure automation.
  • Strong sense of ownership, curiosity, and enthusiasm for tackling complex technical challenges.
  • Passion for problem-solving and a proactive drive to deliver impactful results.
  • A sense of adventure and humor to navigate challenges with a positive mindset.
  • Annual Salary Range

    $180,000 - $440,000 USD

    Benefits

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    [job_alerts.create_a_job]

    Site Reliability Engineer • Palo Alto, CA, US

    [internal_linking.similar_jobs]
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • Palo Alto, CA, United States
    [job_card.full_time]
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer (SRE) / DevOps Engineer

    Site Reliability Engineer (SRE) / DevOps Engineer

    E-Space • Saratoga, CA, US
    [job_card.full_time]
    Ready to make connectivity from space universally accessible, secure, and actionable? Then you’ve come to the right place!. At E-Space, we’re focused on bridging Earth and space with the...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    [job_card.full_time]
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Kubernetes Platform

    Site Reliability Engineer - Kubernetes Platform

    Pantera Capital • Palo Alto, CA, United States
    [job_card.full_time]
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Amiri Recruiting • Mountain View, CA, US
    [job_card.full_time]
    Relevant Skills and Experience.What You’ll Do (Day-to-Day).Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backe...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Archetype AI • Palo Alto, CA, United States
    [job_card.full_time]
    Get AI-powered advice on this job and more exclusive features.Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team f...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer - Remote

    Site Reliability Engineer - Remote

    PayNearMe • Santa Clara, CA, US
    [filters.remote]
    [job_card.full_time]
    At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payment...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    FLUIX • Palo Alto, CA, United States
    [job_card.full_time]
    FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure.We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and pow...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Sr Principal Site Reliability Engineer (SASE)

    Sr Principal Site Reliability Engineer (SASE)

    Palo Alto Networks • Santa Clara, CA, US
    [job_card.full_time]
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Customer Reliability Engineer

    Customer Reliability Engineer

    Cisco Systems, Inc. • San Jose, CA, United States
    [job_card.full_time]
    This is a fully remote position open to candidates located in the United States with a strong preference for candidates based on the West Coast, with the ability to work in the Pacific Time Zone.Ap...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Openstack

    Site Reliability Engineer - Openstack

    Fortinet • Sunnyvale, California, United States
    [job_card.full_time]
    Fortinet is recruiting a Site Reliability Engineer- OPENSTACK to join our FortiStack team.This team is responsible for the management, operation and continued development of our Openstack-based pri...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Foxconn Industrial Internet - FII • San Jose, CA, US
    [job_card.full_time] +1
    Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, precision tools and industr...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Observability

    Site Reliability Engineer - Observability

    Rivian and Volkswagen Group Technologies • Palo Alto, CA, United States
    [job_card.full_time]
    Senior Site Reliability Engineer (SRE).RivianVW's Data Platform - Production Engineering team.In this role, you will design, implement, and scale robust observability systems to ensure the health, ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer – Kubernetes

    Site Reliability Engineer – Kubernetes

    Theklicker • Palo Alto, CA, United States
    [job_card.full_time]
    We are dedicated to being a one-stop solution for purchasing electronic products.With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Cryptoware Technologies Inc • Santa Clara, CA, US
    [job_card.full_time]
    Lead the effort of global expansion of Huobi globe spanning infrastructure.Work with engineering teams to make sure new features and changes are deployed quickly and safely.Constantly improve our s...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Cypress HCM • Hayward, California, United States
    [job_card.full_time]
    As a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principl...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer (L2)

    Site Reliability Engineer (L2)

    Wave Money • Palo Alto, CA, United States
    [job_card.full_time]
    Job Location : The Campus, Pun Hlaing Estate, Hlaing Thar Yar Township, Yangon.Working Hours : 8 : 30 AM to 5 : 30 PM, (Monday to Friday). Site Reliability Engineer is to perform daily support and monitor...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Site Reliability Engineer - SRE at Descope Los Altos, CA

    Itlearn360 • Los Altos, CA, United States
    [job_card.full_time]
    Site Reliability Engineer - SRE job at Descope.Descope R&D group is a skilled team of developers with a unique DNA of creativity,flexibility,anopen mindset. We are looking for a passionate SRE to jo...[show_more]
    [last_updated.last_updated_30] • [promoted]