Talent.com
Slurm Administration & Systems Architecture (Hayward)
Slurm Administration & Systems Architecture (Hayward)Midjourney • Hayward, CA, US
[error_messages.no_longer_accepting]
Slurm Administration & Systems Architecture (Hayward)

Slurm Administration & Systems Architecture (Hayward)

Midjourney • Hayward, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.part_time]
[job_card.job_description]

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • [job_alerts.create_a_job]

    System Administration • Hayward, CA, US

    [internal_linking.similar_jobs]
    Systems Architect

    Systems Architect

    Reliable Robotics • Mountain View, CA, United States
    [job_card.permanent]
    We're building safety-enhancing technology for aviation that will save lives.Automated aviation systems will enable a future where air transportation is safer, more convenient and fundamentally tra...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Head of Systems Engineering and AIT

    Head of Systems Engineering and AIT

    E-Space • Saratoga, CA, US
    [job_card.full_time]
    Ready to make connectivity from space universally accessible, secure and actionable? Then you’ve come to the right place!. E-Space is bridging Earth and space to enable hyper-scaled deployment...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Hardware Systems Architect

    Principal Hardware Systems Architect

    SiTime Corporation • Santa Clara, CA, US
    [job_card.full_time]
    SiTime Corporation is the precision timing company.Our semiconductor MEMS programmable solutions offer a rich feature set that enables customers to differentiate their products with higher performa...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    Midjourney • Hayward, CA, US
    [job_card.full_time]
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Solutions Architect - Observability

    Principal Solutions Architect - Observability

    Elastic • Mountain View, CA, United States
    [job_card.full_time]
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Network Systems Specialist II

    Network Systems Specialist II

    InsideHigherEd • Dublin, California, United States
    [job_card.full_time]
    Las Positas College, 3000 Campus Hill Dr.The Chabot-Las Positas Community College District is seeking a Network Systems Specialist II for Chabot-Las Positas Community College District in Livermore,...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Signals Intelligence Systems Architect

    Signals Intelligence Systems Architect

    Lovefreedom Solution • San Jose, CA, US
    [job_card.full_time]
    Department of Defense TS / SCI security clearance is preferred at time of hire.Candidates must be able to obtain a TS / SCI clearance within a reasonable amount of time from date of hire.Applicants sel...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Solutions Architect (Hayward)

    Solutions Architect (Hayward)

    HCLTech • Hayward, CA, United States
    [job_card.full_time]
    We are HCLTech, one of the fastest-growing large tech companies in the world and home to 222,000+ people across 60 countries, supercharging progress through industry-leading capabilities centered a...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Staff Solutions Architect - Data Infrastructure

    Staff Solutions Architect - Data Infrastructure

    Onehouse • Sunnyvale, CA, US
    [job_card.full_time]
    Onehouse is a mission-driven company dedicated to freeing data from data platform lock-in.We deliver the industry’s most interoperable data lakehouse through a cloud-native managed service bu...[show_more]
    [last_updated.last_updated_30] • [promoted]
    VMWare / Windows / Storage Systems Administrator

    VMWare / Windows / Storage Systems Administrator

    Resource Informatics Group Inc • San Jose, CA, US
    [job_card.full_time]
    Length of Engagement : Through 12 / 31 / 2019 (options to extend).Start date : ASAP, Location : San Jose.This position is needed to provide technical expertise to support VMWare, Windows and Storage.The c...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Business System Analyst

    Principal Business System Analyst

    Cloud Software Group, Inc. • San Ramon, CA, United States
    [job_card.full_time]
    We are seeking a highly skilled.Oracle Fusion Financials and Enterprise Performance Management (EPM) Consultant.The ideal candidate will have hands-on experience implementing, supporting, and optim...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Director of Studio Performance and Operations

    Senior Director of Studio Performance and Operations

    Walnut Creek • Walnut Creek, CA, US
    [job_card.full_time]
    Join one of the fastest-growing wellness brands in the country.Kalologie is expanding rapidly nationwide, and we’re seeking a dynamic . Senior Director of Studio Performance and Operation...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Technical Solutions Architect (Hayward)

    Technical Solutions Architect (Hayward)

    turbalance • Hayward, CA, US
    [job_card.part_time]
    Turbalance AI is an innovative, emerging startup that transforms AI laws.We are a team of passionate problem-solvers who believe in what were building. We constantly push boundaries and embrace our ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Data Center Solutions Architect

    Principal Data Center Solutions Architect

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal / Lead Wireless Communications System Architect

    Principal / Lead Wireless Communications System Architect

    Omni Design Technologies • Milpitas, CA, US
    [job_card.full_time]
    Principal / Lead Wireless Communications System Architect.Wireless Communications, Software-Defined Radio (SDR), Semiconductor IP and Advanced SoC. Omni Design Technologies is a leading provider of ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer - Business Systems

    Systems Engineer - Business Systems

    Palantir Technologies • Palo Alto, CA, US
    [job_card.full_time]
    Palantir builds the world’s leading software for data-driven decisions and operations.By bringing the right data to the people who need it, our platforms empower our partners to develop lifes...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Architect

    Principal Architect

    CriticalRiver Inc • Pleasanton, CA, US
    [job_card.full_time]
    Job title : Principal Architect.Location : Pleasanton, California, United States (Hybrid).We’re hiring an exceptional.SaaS platform for AI-powered financial intelligence and workflow automation...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Travel Speech Language Pathologist (SLP) in Pleasant Hill, CA

    Travel Speech Language Pathologist (SLP) in Pleasant Hill, CA

    AlliedTravelCareers • Pleasant Hill, CA, US
    [job_card.full_time] +1
    Competitive weekly pay (inquire for details) .AlliedTravelCareers is working with Infojini Healthcare to find a qualified Speech Language Pathologist (SLP) in Pleasant Hill, California, 94523!.Spee...[show_more]
    [last_updated.last_updated_30] • [promoted]