Talent.com
Slurm Administration & Systems Architecture (Sonoma)
Slurm Administration & Systems Architecture (Sonoma)Midjourney • Sonoma, CA, United States
[error_messages.no_longer_accepting]
Slurm Administration & Systems Architecture (Sonoma)

Slurm Administration & Systems Architecture (Sonoma)

Midjourney • Sonoma, CA, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • [job_alerts.create_a_job]

    System Administration • Sonoma, CA, United States

    [internal_linking.similar_jobs]
    Anaplan Solution Architect (Sonoma)

    Anaplan Solution Architect (Sonoma)

    Galent • Sonoma, CA, US
    [job_card.part_time]
    We have an immediate opening for a.IT service / solutions provider in USA.Role : Anaplan Solution Architect.Location : Bay Area or Remote with travel as needed (paid by client).Understand the Demand pl...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Director, Data and AI Architecture Leader

    Senior Director, Data and AI Architecture Leader

    Dynavax Technologies • Emeryville, CA, United States
    [job_card.full_time]
    This position can be 100% remote, but must be located in the United States.Dynavax is a commercial-stage biopharmaceutical company developing and commercializing novel vaccines to help protect the ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Security Architect, Software EngineeringSoftware Engineering • Berkeley, CA • Full time • On-site

    Principal Security Architect, Software EngineeringSoftware Engineering • Berkeley, CA • Full time • On-site

    Form Energy • Berkeley, CA, United States
    [job_card.full_time]
    Are you ready to build America's energy future? Form Energy is an American manufacturing and energy technology company.We're revolutionizing energy storage with cost-effective, multi-day technology...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Staff Systems Engineer

    Staff Systems Engineer

    Bio-Rad Laboratories • Hercules, CA, United States
    [job_card.full_time]
    Working within Bio-Rad's Life Science R&D Group as a Systems Engineer, you will take engineering concepts, requirements and transform them into functional prototypes and finished products that impr...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    Midjourney • Sonoma, CA, US
    [job_card.full_time]
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Solutions Architect (Sonoma)

    Solutions Architect (Sonoma)

    Strategic Employment Partners (SEP) • Sonoma, CA, US
    [job_card.full_time] +1
    A well-established organization with complex systems is hiring a.In this role, youll design scalable systems, define integration strategies, and guide cloud transformation initiatives.Lead architec...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior NetSuite Solution Architect

    Senior NetSuite Solution Architect

    The Rockridge Group • Emeryville, CA, US
    [job_card.full_time]
    NetSuite Senior Solution Architect.Groundbreaking advances in synthetic biology achieved at Client X allow us to create products that are better and safer for humans and more sustainable for the pl...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Control Systems Services Manager

    Control Systems Services Manager

    E Tech Group • Berkeley, CA, US
    [job_card.full_time]
    At E Tech Group, joining our team means joining a group of passionate and forward-thinking experts.We're one of the largest engineering and system integration firms in the United States ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Living Systems Design Engineer

    Living Systems Design Engineer

    Habitat Horticulture • Berkeley, CA, US
    [job_card.full_time]
    Salary : $75,000 to $110,000 annually DOE.Habitat Horticulture is a leader in living architecture.We create living walls, large-scale interior gardens, and planted faades that bring buildings and ci...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Lead Architect IV - West Region

    Lead Architect IV - West Region

    CRB • Emeryville, CA, US
    [job_card.full_time]
    CRB is a leading provider of sustainable engineering, architecture, construction and consulting solutions to the global life sciences and food and beverage industries. Our more than 1,100 employees ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Controls Engineer

    Senior Controls Engineer

    E Tech Group • Sonoma, CA, US
    [job_card.full_time]
    We're one of the largest engineering and system integration firms in the United States providing value for our clients through IT automation and control solutions for more than 25 years to the ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Distributed Systems Engineer (Sonoma)

    Distributed Systems Engineer (Sonoma)

    DeepRec.ai • Sonoma, CA, US
    [job_card.part_time]
    A fast-moving AI research group is building the core video data infrastructure used by leading AI labs and major tech companies. The team is small at around fifteen people, nearly all engineers, and...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Systems Engineer, Modeling Focus

    Senior Systems Engineer, Modeling Focus

    Atomic Machines • Emeryville, CA, US
    [job_card.full_time]
    Atomic Machines is ushering in a new era of micromanufacturing with its Matter Compiler™ technology platform.This platform enables new classes of micromachines to be designed and built by pro...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Technical Solutions Architect (Sonoma)

    Technical Solutions Architect (Sonoma)

    turbalance • Sonoma, CA, US
    [job_card.part_time]
    Turbalance AI is an innovative, emerging startup that transforms AI laws.We are a team of passionate problem-solvers who believe in what were building. We constantly push boundaries and embrace our ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Software Engineer, Compute Infrastructure

    Senior Software Engineer, Compute Infrastructure

    Rigetti Computing • Berkeley, CA, US
    [job_card.full_time]
    Rigetti Computing is building the world’s most powerful computers to solve humanity’s most pressing problems.We believe this technology will fundamentally change the world for the bette...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Oracle HCM Cloud Architect (Techno-Functional) (Sonoma)

    Oracle HCM Cloud Architect (Techno-Functional) (Sonoma)

    Flexton Inc. • Sonoma, CA, US
    [job_card.full_time] +2
    Job Title : Oracle HCM Cloud Architect (Techno-Functional).Oracle HCM Cloud Architect (Techno-Functional).This individual will collaborate with HR, IT, and Compliance teams to ensure the platform al...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Travel Speech Language Pathologist (SLP) - $1,960 to $2,264 per week in San Pablo, CA

    Travel Speech Language Pathologist (SLP) - $1,960 to $2,264 per week in San Pablo, CA

    AlliedTravelCareers • San Pablo, CA, US
    [job_card.full_time]
    AlliedTravelCareers is working with National Staffing Solutions to find a qualified Speech Language Pathologist (SLP) in San Pablo, California, 94806!. Details of the SLP - Skilled Nursing opening i...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Software Engineer, Windows / Desktop Applications - Berkeley, USA

    Senior Software Engineer, Windows / Desktop Applications - Berkeley, USA

    Speechify • Berkeley, CA, US
    [job_card.full_time]
    The mission of Speechify is to make sure that reading is never a barrier to learning.Over 50 million people use Speechify's text-to-speech products to turn whatever they're reading – ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]