Talent.com
Slurm Administration & Systems Architecture
Slurm Administration & Systems ArchitectureMidjourney • Hayward, CA, US
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

Midjourney • Hayward, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • [job_alerts.create_a_job]

    System Administration • Hayward, CA, US

    [internal_linking.similar_jobs]
    Systems Engineer - Campus ( Wired and Wireless)

    Systems Engineer - Campus ( Wired and Wireless)

    Arista Networks • Santa Clara, CA, US
    [job_card.full_time]
    Arista Networks is an industry leader in data-driven, client-to-cloud networking for large data center, campus and routing environments. What sets us apart is our relentless pursuit of innovation.We...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer

    Systems Engineer

    WeRide.ai • San Jose, CA, US
    [job_card.full_time]
    Established in 2017, WeRide (NASDAQ : WRD) is a leading global commercial-stage company that develops autonomous driving technologies from Level 2 to Level 4. WeRide is the only tech company in the w...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Information Systems Engineer 3

    Information Systems Engineer 3

    TalentBurst, Inc. • Sunnyvale, CA, US
    [job_card.permanent]
    Onsite 3 days a week or as per the latest WOW Policy.Top Skills / Detailed Job Description : .Must have Anaplan Level 3 model builder certification. Experience with direct interaction with clients in ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Systems Architect

    Systems Architect

    Reliable Robotics • Mountain View, CA, United States
    [job_card.permanent]
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Head of Systems Engineering and AIT

    Head of Systems Engineering and AIT

    E-Space • Saratoga, CA, US
    [job_card.full_time]
    Ready to make connectivity from space universally accessible, secure and actionable? Then you’ve come to the right place!. E-Space is bridging Earth and space to enable hyper-scaled deployment...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    Elastic • Mountain View, CA, United States
    [job_card.full_time]
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
    [last_updated.last_updated_30] • [promoted]
    SLP - SLP

    SLP - SLP

    Win Country Dr Care Ce -R05A16 • Fremont, CA, United States
    [job_card.permanent]
    Win Country Dr Care Ce -R05A16.[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Integration Lead

    Systems Integration Lead

    Pacific Defense • Sunnyvale, CA, US
    [job_card.full_time]
    Due to the classified nature of our work, U.Candidate must meet the eligibility to obtain and maintain a DoD Top Secret / SCI Security Clearance. Pacific Defense is a leading developer of advanced e...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Business System Analyst

    Principal Business System Analyst

    Cloud Software Group, Inc. • San Ramon, CA, United States
    [job_card.full_time]
    We are seeking a highly skilled.Oracle Fusion Financials and Enterprise Performance Management (EPM) Consultant.The ideal candidate will have hands-on experience implementing, supporting, and optim...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Director of Studio Performance and Operations

    Senior Director of Studio Performance and Operations

    Walnut Creek • Walnut Creek, CA, US
    [job_card.full_time]
    Join one of the fastest-growing wellness brands in the country.Kalologie is expanding rapidly nationwide, and we’re seeking a dynamic . Senior Director of Studio Performance and Operation...[show_more]
    [last_updated.last_updated_30] • [promoted]
    SLP - SLP

    SLP - SLP

    Washington Center -R05A14 • San Leandro, CA, United States
    [job_card.permanent]
    [show_more]
    [last_updated.last_updated_30] • [promoted]
    System Solutions Architect -Large Clusters for AI & HPC workloads

    System Solutions Architect -Large Clusters for AI & HPC workloads

    Advanced Micro Devices, Inc. • San Jose, CA, United States
    [job_card.full_time]
    WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Data Center Solutions Architect

    Principal Data Center Solutions Architect

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer

    Systems Engineer

    InsideHigherEd • Stanford, California, United States
    [job_card.full_time]
    Business Affairs : University IT (UIT), Redwood City, California, United States.Information Technology Services📅Nov 17, 2025 Post Date📅107744 Requisition #. Endpoint Engineering and Developme...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Systems Engineer - Business Systems

    Systems Engineer - Business Systems

    Palantir Technologies • Palo Alto, CA, US
    [job_card.full_time]
    Palantir builds the world’s leading software for data-driven decisions and operations.By bringing the right data to the people who need it, our platforms empower our partners to develop lifes...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Consulting Cloud Architect (Remote, US)

    Consulting Cloud Architect (Remote, US)

    Fortinet • Sunnyvale, CA, United States
    [filters.remote]
    [job_card.full_time]
    Fortinet is seeking a Cloud Architect supporting Public Cloud with a focus on AWS.As a part of the Public Cloud team, this role will support the technical alliance relationship with AWS to develop ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Executive Director - Neuroscience and Orthopedic Service Line - AS Administration - Full Time - 8 hour - Days

    Executive Director - Neuroscience and Orthopedic Service Line - AS Administration - Full Time - 8 hour - Days

    John Muir Health • Walnut Creek, CA, United States
    [job_card.full_time]
    Under the direction of the Vice President of Service Lines, the Executive Director of Neurosciences and Orthopedics supports and promotes the mission and philosophy of John Muir Health.Provides lea...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer (Contract)

    Senior Systems Engineer (Contract)

    Blue Star Partners LLC • Pleasanton, CA, US
    [job_card.permanent]
    Pleasanton, CA – 100% onsite – Local candidates only.Strong potential for extension / direct hire.Hours over 40 will be paid at Time and a Half. The Senior Systems Engineer (Contract) will...[show_more]
    [last_updated.last_updated_30] • [promoted]