Talent.com
Slurm Administration & Systems Architecture (Hayward)
Slurm Administration & Systems Architecture (Hayward)Midjourney • Hayward, CA, US
[error_messages.no_longer_accepting]
Slurm Administration & Systems Architecture (Hayward)

Slurm Administration & Systems Architecture (Hayward)

Midjourney • Hayward, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.part_time]
[job_card.job_description]

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • [job_alerts.create_a_job]

    System Administration • Hayward, CA, US

    [internal_linking.similar_jobs]
    Senior Solutions Architect

    Senior Solutions Architect

    Workato • Palo Alto, CA, US
    [job_card.full_time]
    Workato transforms technology complexity into business opportunity.As the leader in enterprise orchestration, Workato helps businesses globally streamline operations by connecting data, processes, ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Information Systems Engineer 3

    Information Systems Engineer 3

    TalentBurst, Inc. • Sunnyvale, CA, US
    [job_card.permanent]
    Onsite 3 days a week or as per the latest WOW Policy.Top Skills / Detailed Job Description : .Must have Anaplan Level 3 model builder certification. Experience with direct interaction with clients in ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer

    Systems Engineer

    WeRide.ai • San Jose, CA, US
    [job_card.full_time]
    Established in 2017, WeRide (NASDAQ : WRD) is a leading global commercial-stage company that develops autonomous driving technologies from Level 2 to Level 4. WeRide is the only tech company in the w...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Architect

    Systems Architect

    Reliable Robotics • Mountain View, CA, United States
    [job_card.permanent]
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    Elastic • Mountain View, CA, United States
    [job_card.full_time]
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Signals Intelligence Systems Architect

    Signals Intelligence Systems Architect

    Lovefreedom Solution • San Jose, CA, US
    [job_card.full_time]
    Department of Defense TS / SCI security clearance is preferred at time of hire.Candidates must be able to obtain a TS / SCI clearance within a reasonable amount of time from date of hire.Applicants sel...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Solution Architect

    Senior Solution Architect

    LotusFlare, Inc. • Santa Clara, CA, US
    [job_card.full_time]
    LotusFlare employees join and remain at LotusFlare for two simple reasons.First, they can see immediately that their work makes a positive impact on LotusFlare customers, and second, they grow on a...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Staff Solutions Architect - Data Infrastructure

    Staff Solutions Architect - Data Infrastructure

    Onehouse • Sunnyvale, CA, US
    [job_card.full_time]
    Onehouse is a mission-driven company dedicated to freeing data from data platform lock-in.We deliver the industry’s most interoperable data lakehouse through a cloud-native managed service bu...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer - Pleasanton, CA

    Senior Systems Engineer - Pleasanton, CA

    Calyxo • Pleasanton, CA, US
    [job_card.full_time]
    The company was founded in 2016 to address the profound need for improved kidney stone treatment.Kidney stone disease is a common, painful condition that consumes vast amounts of healthcare resourc...[show_more]
    [last_updated.last_updated_variable_hours] • [promoted] • [new]
    VMWare / Windows / Storage Systems Administrator

    VMWare / Windows / Storage Systems Administrator

    Resource Informatics Group Inc • San Jose, CA, US
    [job_card.full_time]
    Length of Engagement : Through 12 / 31 / 2019 (options to extend).Start date : ASAP, Location : San Jose.This position is needed to provide technical expertise to support VMWare, Windows and Storage.The c...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Sr. Lead Linux Systems Administrator

    Sr. Lead Linux Systems Administrator

    The Rockridge Group • Milpitas, CA, US
    [job_card.full_time]
    Lead Linux Systems Administrator.We are seeking an experienced Linux System Administrator who can provide solutions and be able to work independently and a good team player.Responsible for administ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Integration Lead

    Systems Integration Lead

    Pacific Defense • Sunnyvale, CA, US
    [job_card.full_time]
    Due to the classified nature of our work, U.Candidate must meet the eligibility to obtain and maintain a DoD Top Secret / SCI Security Clearance. Pacific Defense is a leading developer of advanced e...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Sr Solutions Architect (Pleasanton, CA)

    Sr Solutions Architect (Pleasanton, CA)

    Presidio Networked Solutions, LLC • Pleasanton, CA, United States
    [job_card.full_time]
    Presidio, Where Teamwork and Innovation Shape the Future.AtPresidio, we're at the forefront of a global technology revolution, transforming industries throughcutting-edge digital solutions and next...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal / Lead Wireless Communications System Architect

    Principal / Lead Wireless Communications System Architect

    Omni Design Technologies • Milpitas, CA, US
    [job_card.full_time]
    Principal / Lead Wireless Communications System Architect.Wireless Communications, Software-Defined Radio (SDR), Semiconductor IP and Advanced SoC. Omni Design Technologies is a leading provider of ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer - Business Systems

    Systems Engineer - Business Systems

    Palantir Technologies • Palo Alto, CA, US
    [job_card.full_time]
    Palantir builds the world’s leading software for data-driven decisions and operations.By bringing the right data to the people who need it, our platforms empower our partners to develop lifes...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Architect

    Principal Architect

    CriticalRiver Inc • Pleasanton, CA, US
    [job_card.full_time]
    Job title : Principal Architect.Location : Pleasanton, California, United States (Hybrid).We’re hiring an exceptional.SaaS platform for AI-powered financial intelligence and workflow automation...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Data Center Solutions Architect

    Principal Data Center Solutions Architect

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer (Contract)

    Senior Systems Engineer (Contract)

    Blue Star Partners LLC • Pleasanton, CA, US
    [job_card.permanent]
    Pleasanton, CA – 100% onsite – Local candidates only.Strong potential for extension / direct hire.Hours over 40 will be paid at Time and a Half. The Senior Systems Engineer (Contract) will...[show_more]
    [last_updated.last_updated_30] • [promoted]