Talent.com
Slurm Administration & Systems Architecture
Slurm Administration & Systems ArchitectureMidjourney • Alameda, CA, US
Slurm Administration & Systems Architecture

Slurm Administration & Systems Architecture

Midjourney • Alameda, CA, US
job_description.job_card.30_days_ago
serp_jobs.job_preview.job_type
  • serp_jobs.job_card.full_time
job_description.job_card.job_description

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication, and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • serp_jobs.job_alerts.create_a_job

    System Administration • Alameda, CA, US

    Job_description.internal_linking.related_jobs
    LLM Platform Lead & Systems Architect

    LLM Platform Lead & Systems Architect

    Scale AI, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading AI solutions provider in San Francisco is seeking a professional to optimize their ML training and inference frameworks. The role involves collaborating with research teams, developing lar...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Lead Systems Architect & Technical Visionary

    Lead Systems Architect & Technical Visionary

    Amazon • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A major tech company is seeking a Principal Engineer to provide technical leadership and establish engineering standards. You will address challenging problems by building high-quality systems that ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Lead Enterprise Systems Architect & Project Leader

    Lead Enterprise Systems Architect & Project Leader

    San Francisco • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Join a dynamic city at the forefront of innovation as a Principal Information Systems Engineer.In this pivotal role, you will lead the analysis, planning, and implementation of complex systems and ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Solutions Architect

    Solutions Architect

    TechBiz Global GmbH • San Francisco, CA, US
    serp_jobs.job_card.full_time
    At TechBiz Global, we are providing recruitment service to our TOP clients from our portfolio.We are currently seeking a bilingual. If you're looking for an exciting opportunity to grow in an innova...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30
    Principal Architect, Software Engineering - Distributed Fault Tolerant Systems, Resilience and [...]

    Principal Architect, Software Engineering - Distributed Fault Tolerant Systems, Resilience and [...]

    Salesforce • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Principal Architect, Software Engineering - Distributed Fault Tolerant Systems, Resilience and Self Healing.Salesforce is the #1 AI CRM, driving customer success through AI, innovation and a cultur...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    GTM Systems Architect

    GTM Systems Architect

    Epoch Biodesign • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A leading climate tech company based in San Francisco seeks a Go-to-Market Engineer.This hybrid role involves designing scalable systems to enhance GTM efficiency while partnering with Sales, Marke...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Strategic Employment Partners (SEP) • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Strategic Employment Partners (SEP) provided pay range.This range is provided by Strategic Employment Partners (SEP).Your actual pay will be based on your skills and experience — talk with your rec...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Casca • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Casca is building AGI for banking.We’re replacing decades-old legacy systems with AI-native technology that automates 90% of the manual work humans once had to do. Architect the Future of AI-Driven ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Space Systems Solutions Architect

    Space Systems Solutions Architect

    Planet Labs PBC • San Francisco, California, United States
    serp_jobs.job_card.full_time
    We believe in using space to help life on Earth.Planet designs, builds, and operates the largest constellation of imaging satellites in history. This constellation delivers an unprecedented dataset ...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Insight Global • San Francisco, California, United States
    serp_jobs.job_card.full_time
    Job Description About the role Be the liaison between our Enterprise customers and our Cloud Engineers.Solution Architects onboard new customers to ensure that customers are able to install, manage...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_hours • serp_jobs.job_card.promoted • serp_jobs.job_card.new
    Specialist Solutions Architect, Radar (Fraud / Risk)

    Specialist Solutions Architect, Radar (Fraud / Risk)

    Stripe • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Specialist Solutions Architect, Radar (Fraud / Risk).Specialist Solutions Architect, Radar (Fraud / Risk).Be among the first 25 applicants. Specialist Solutions Architect, Radar (Fraud / Risk).Stripe is a...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Design Systems Architect — Scale & Cohesion

    Senior Design Systems Architect — Scale & Cohesion

    Expedia, Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    A global travel company is seeking an experienced design lead to advance its design system.This role involves creating scalable components and collaborating with engineering teams to ensure inclusi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Creospan Inc. • Menlo Park, California, United States
    serp_jobs.job_card.temporary
    Creospan is a growing tech collective of makers, shakers, and problem solvers, offering solutions today that will propel businesses into a better tomorrow. Tomorrow’s ideas, built today!” In additio...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Senior Systems Engineer, Infrastructure & Platform Reliability

    Senior Systems Engineer, Infrastructure & Platform Reliability

    Lambda Inc. • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Lambda, The Superintelligence Cloud, builds Gigawatt-scale AI Factories for Training and Inference.Lambda’s mission is to make compute as ubiquitous as electricity and give every person access to a...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted
    Manager, Solution Architect

    Manager, Solution Architect

    KPMG Careers • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    The KPMG Advisory practice is currently our fastest growing practice.We are seeing tremendous client demand, and looking forward we dont anticipate that slowing down. In this ever-changing market en...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Pega - Lead Solutioning Architect (LSA)

    Pega - Lead Solutioning Architect (LSA)

    Capgemini • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Should have 20 plus years of experience leading, designing and delivering enterprise application solutions and driving Digital transformation across diverse industries. Minimum 15 plus years of whic...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_1_day • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Stefanini, Inc • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Join us to co-create solutions for a better future!.Solutions Architect – San Francisco, CA – Posted : 9 / 30 / 2025.Job Category : Information Technology. Stefanini is looking for a Solutions Architect i...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_30 • serp_jobs.job_card.promoted
    Solutions Architect

    Solutions Architect

    Langfuse • San Francisco, CA, United States
    serp_jobs.job_card.full_time
    Open Source LLM Engineering Platform that helps teams build useful AI applications via tracing, evaluation, and prompt management (mission, product). We have the chance to build the "Datadog" of thi...serp_jobs.internal_linking.show_more
    serp_jobs.last_updated.last_updated_variable_days • serp_jobs.job_card.promoted