Talent.com
Site Reliability Engineer - Storage
Site Reliability Engineer - StoragexAI • Palo Alto, CA, US
Site Reliability Engineer - Storage

Site Reliability Engineer - Storage

xAI • Palo Alto, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

Job Description

About xAI

xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company's mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the role

As a Site Reliability Storage Engineer, you will play a pivotal role in designing, building, and operating exascale storage systems to manage our cutting-edge AI research data with unparalleled scalability and reliability across multiple regions. This role's core responsibility is to make sure our heterogenous storage systems in on-prem + cloud are reliable and performant.

We're seeking engineers with expertise in exascale data management systems or distributed filesystems to join our mission-driven team.

What you'll do

  • Develop and optimize software to manage exascale data, enabling efficient and reliable access for xAI researchers working on advanced AI models.
  • Enhance the reliability, performance, and cost-effectiveness of xAI's storage infrastructure to support large-scale AI research workloads.
  • Collaborate closely with researchers to understand their data use cases and tailor storage solutions to meet their needs.
  • Implement robust security measures to safeguard critical datasets, ensuring data integrity and confidentiality.

Ideal Experience

You'd be an exceptional candidate if you possess some (or all) of the following :

  • Writing scalable, high-performance code in Rust or Go for storage-related applications or tooling.
  • Managing storage infrastructure with IaC tools like Pulumi, Terraform, or Ansible.
  • Past experience working with storage vendors facilitating partnership alignment, and integrating their tooling within xAI's Infrastructure.
  • Familiarity with Kubernetes storage primitives (e.g., Persistent Volumes, CSI drivers) and integrating storage with containerized workloads.
  • Bonus : Experience with AI / ML data pipelines, including handling large datasets for training and inference.
  • Tech Stack

  • Kubernetes
  • Pulumi
  • Rust and Go
  • Interview Process

    After submitting your application, the team reviews your CV and statement of exceptional work. If your application passes this stage, you will be invited to a 45 minute interview ("phone interview") during which a member of our team will ask some basic questions. If you clear the initial phone interview, you will enter the main process, which consists of four technical interviews :

  • Coding assessment in Python, Golang, or Rust
  • Systems hands-on : Demonstrate practical skills in a live problem-solving session.
  • Coding assessment or system design discussion based on the candidate's background.
  • Project deep-dive : Present your past exceptional work to a small audience.
  • Every application is reviewed by a member of our technical team. All interviews will be conducted via Google Meet.

    We do not condone usage of AI in interviews and have tools to detect AI usage.

    Annual Salary Range

    $180,000 - $440,000 USD

    Benefits

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    xAI is an equal opportunity employer.

    California Consumer Privacy Act (CCPA) Notice

    [job_alerts.create_a_job]

    Site Reliability Engineer • Palo Alto, CA, US

    [internal_linking.related_jobs]
    Senior Technology Site Reliability Engineer

    Senior Technology Site Reliability Engineer

    Cooley LLP • Palo Alto, CA, United States
    [job_card.full_time]
    Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineering

    Site Reliability Engineering

    Forhyre • Sunnyvale, CA, US
    [job_card.full_time]
    Forhyre is looking for engineers who can bring unique perspectives and innovative ideas to all areas of development and are interested in continuing to improve our platform through the ever-changin...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    PsiQuantum • Palo Alto, CA, United States
    [job_card.full_time]
    Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Amiri Recruiting • Mountain View, CA, US
    [job_card.full_time]
    Relevant Skills and Experience.What You’ll Do (Day-to-Day).Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backe...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Remote

    Site Reliability Engineer - Remote

    PayNearMe • Santa Clara, CA, US
    [filters.remote]
    [job_card.full_time]
    At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payment...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Fortinet • Sunnyvale, CA, United States
    [job_card.full_time]
    At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Sr Principal Site Reliability Engineer (SASE)

    Sr Principal Site Reliability Engineer (SASE)

    Palo Alto Networks • Santa Clara, CA, US
    [job_card.full_time]
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Kubernetes SRE — Storage & Platform Reliability

    Senior Kubernetes SRE — Storage & Platform Reliability

    x.ai • Palo Alto, CA, United States
    [job_card.full_time]
    A leading AI technology company is seeking a Senior Site Reliability Engineer to focus on designing and optimizing Kubernetes infrastructure. The role involves collaborating with engineering teams t...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Customer Reliability Engineer

    Customer Reliability Engineer

    Cisco Systems, Inc. • San Jose, CA, United States
    [job_card.full_time]
    This is a fully remote position open to candidates located in the United States with a strong preference for candidates based on the West Coast, with the ability to work in the Pacific Time Zone.Ap...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Foxconn Industrial Internet - FII • San Jose, CA, US
    [job_card.full_time] +1
    Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, precision tools and industr...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Observability

    Site Reliability Engineer - Observability

    Rivian and Volkswagen Group Technologies • Palo Alto, CA, United States
    [job_card.full_time]
    Senior Site Reliability Engineer (SRE).RivianVW's Data Platform - Production Engineering team.In this role, you will design, implement, and scale robust observability systems to ensure the health, ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer (SRE)

    Site Reliability Engineer (SRE)

    OPPO • Palo Alto, CA, United States
    [job_card.full_time]
    OPPO US Research Center is seeking a skilled and proactive.Site Reliability Engineer (SRE).In this role, you will be responsible for ensuring the stability, scalability, and performance of our appl...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    Key2Source • San Leandro, California, USA
    [job_card.full_time]
    Job Title : Site Reliability Engineer.Location : San Leandro CA (Onsite).Engineering experience or equivalent demonstrated through one or a combination of the following : work experience training mili...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer (L2)

    Site Reliability Engineer (L2)

    Wave Money • Palo Alto, CA, United States
    [job_card.full_time]
    Job Location : The Campus, Pun Hlaing Estate, Hlaing Thar Yar Township, Yangon.Working Hours : 8 : 30 AM to 5 : 30 PM, (Monday to Friday). Site Reliability Engineer is to perform daily support and monitor...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Staff Site Reliability Engineer

    Staff Site Reliability Engineer

    Grindr • Palo Alto, CA, United States
    [job_card.full_time]
    Staff Site Reliability Engineer.Get AI-powered advice on this job and more exclusive features.This range is provided by Grindr. Your actual pay will be based on your skills and experience — talk wit...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Site Reliability Engineer

    Site Reliability Engineer

    VirtualVocations • Fremont, California, United States
    [job_card.full_time]
    A company is looking for a Site Reliability Engineer to enhance observability and reliability practices within a distributed environment. Key Responsibilities Own and evolve the observability stac...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Site Reliability Engineer - Storage

    Site Reliability Engineer - Storage

    Pantera Capital • Palo Alto, CA, United States
    [job_card.full_time]
    AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Sr. Site Reliability Engineer (SRE)

    Sr. Site Reliability Engineer (SRE)

    Avenue Code • Mountain View, CA, United States
    [job_card.full_time]
    We’re seeking an experienced, highly collaborative SRE to partner with product teams and tackle our most critical infrastructure challenges. You’ll be hands-on in designing, building, and operating ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]