Talent.com
Slurm Administration & Systems Architecture (Sonoma)
Slurm Administration & Systems Architecture (Sonoma)Midjourney • Sonoma, CA, United States
Slurm Administration & Systems Architecture (Sonoma)

Slurm Administration & Systems Architecture (Sonoma)

Midjourney • Sonoma, CA, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Overview

We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster engineering, administration, and performance optimization, with emphasis on GPU-accelerated computing, advanced networking, and workload scheduling. In this role, you will work closely with our researchers, vendors, and partners to manage Slurm clusters that are used for AI / ML workloads.

Responsibilities

Cluster Engineering & Deployment

  • Participate in the design and bring-up of bare metal HPC / AI / ML environments
  • Architect compute node definitions (NUMA, GRES GPU topologies, CPU pinning) and Slurm partitioning strategies for diverse workloads.
  • Integrate heterogeneous hardware platforms into cohesive scheduling environments.
  • Develop provisioning and imaging workflows (Ansible, MAAS, cloud-init, CI / CD pipelines) for reproducible cluster build-out.
  • Coordinate communications between vendors, researchers, and other partners during cluster bring-up and operation.

Slurm Management

  • Configure and operate the Slurm Workload Manager.
  • Build custom Slurm plugins and scripts (epilog / prolog, pam_slurm_adopt) to extend functionality and integrate with authentication,and monitoring.
  • Manage federated Slurm setups across multi-site or hybrid cloud environments.
  • System Administration & Monitoring

  • Administer Linux HPC environments, including network configuration, storage integration, and kernel tuning for HPC workloads.
  • Deploy and maintain observability stacks for system health, GPU metrics, and job monitoring.
  • Automate failure detection, node health checks, and job cleanup to ensure high uptime and reliability.
  • Manage security and access control (LDAP / SSSD, VPN, PAM, SSH session auditing).
  • User & Stakeholder Support

  • Assist cluster users with developing workflows that make efficient use of compute resources.
  • Containerize HPC applications with Docker / Podman / Enroot-Pyxis and integrate GPU-aware runtimes into Slurm jobs.
  • Automate cost accounting and cluster usage reporting.
  • Qualifications

  • 7+ years experience in HPC cluster administration and engineering, with deep knowledge of Slurm.
  • Familiarity with common AI / ML software package dependencies and workflows
  • Expert in Slurm configuration, partition design, QoS / preemption policies, and GRES GPU scheduling.
  • Strong background in Linux system administration, networking, and performance tuning for HPC environments.
  • Hands-on experience with parallel file system, advanced networking (InfiniBand, RoCE, 100 / 200 GbE), and monitoring stacks.
  • Proficient with automation tools (Ansible, Terraform, CI / CD pipelines) and version control.
  • Demonstrated ability to operate GPU-accelerated clusters at scale.
  • [job_alerts.create_a_job]

    System Administration • Sonoma, CA, United States

    [internal_linking.similar_jobs]
    Senior Director, Data and AI Architecture Leader

    Senior Director, Data and AI Architecture Leader

    Dynavax Technologies • Emeryville, CA, United States
    [job_card.full_time]
    This position can be 100% remote, but must be located in the United States.Dynavax is a commercial-stage biopharmaceutical company developing and commercializing novel vaccines to help protect the ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263

    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263

    InsideHigherEd • Berkeley, California, United States
    [job_card.full_time]
    Ground Software & Systems Manager - Mission Operations (0346U), Space Sciences Laboratory - 81263.At the University of California, Berkeley, we are dedicated to fostering a community where everyone...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Oracle HCM Cloud Architect (Techno-Functional) (Sonoma)

    Oracle HCM Cloud Architect (Techno-Functional) (Sonoma)

    Flexton Inc. • Sonoma, CA, United States
    [job_card.full_time] +1
    Job Title : Oracle HCM Cloud Architect (Techno-Functional).Oracle HCM Cloud Architect (Techno-Functional).This individual will collaborate with HR, IT, and Compliance teams to ensure the platform al...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Electrical Engineer

    Electrical Engineer

    Ocean Power Technologies Inc • Richmond, CA, US
    [job_card.full_time]
    VP of Technology & Innovation.OPT) provides intelligent maritime solutions and services that enable safer, cleaner, and more productive ocean operations for the defense and security, oil and ga...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Slurm Administration & Systems Architecture

    Slurm Administration & Systems Architecture

    Midjourney • Sonoma, CA, US
    [job_card.full_time]
    We are seeking a highly skilled HPC / AI / ML Cluster Engineer to support the design, deployment, and ongoing operations of large-scale HPC environments powered by Slurm. This role centers on cluster en...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior NetSuite Solution Architect

    Senior NetSuite Solution Architect

    The Rockridge Group • Emeryville, CA, US
    [job_card.full_time]
    NetSuite Senior Solution Architect.Groundbreaking advances in synthetic biology achieved at Client X allow us to create products that are better and safer for humans and more sustainable for the pl...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Living Systems Design Engineer

    Living Systems Design Engineer

    Habitat Horticulture • Berkeley, CA, US
    [job_card.full_time]
    Salary : $75,000 to $110,000 annually DOE.Habitat Horticulture is a leader in living architecture.We create living walls, large-scale interior gardens, and planted faades that bring buildings and ci...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Lead Architect IV - West Region

    Lead Architect IV - West Region

    CRB • Emeryville, CA, US
    [job_card.full_time]
    CRB is a leading provider of sustainable engineering, architecture, construction and consulting solutions to the global life sciences and food and beverage industries. Our more than 1,100 employees ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Conflicts Analyst

    Senior Conflicts Analyst

    Direct Counsel • Saint Helena, CA, US
    [job_card.full_time]
    This is a high-level opportunity for a seasoned professional who thrives in a fast-paced environment, enjoys solving complex problems, and takes pride in ensuring the ethical integrity and operatio...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Technical Solutions Architect (Sonoma)

    Technical Solutions Architect (Sonoma)

    turbalance • Sonoma, CA, United States
    [job_card.full_time]
    Turbalance AI is an innovative, emerging startup that transforms AI laws.We are a team of passionate problem-solvers who believe in what were building. We constantly push boundaries and embrace our ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Solutions Architect (Sonoma)

    Solutions Architect (Sonoma)

    HCLTech • Sonoma, CA, United States
    [job_card.full_time]
    We are HCLTech, one of the fastest-growing large tech companies in the world and home to 222,000+ people across 60 countries, supercharging progress through industry-leading capabilities centered a...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Controls Engineer

    Senior Controls Engineer

    E Tech Group • Sonoma, CA, US
    [job_card.full_time]
    We're one of the largest engineering and system integration firms in the United States providing value for our clients through IT automation and control solutions for more than 25 years to the ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Technology & Information Lead

    Technology & Information Lead

    Vertosa • Berkeley, CA, US
    [job_card.full_time]
    As the Technology, Security, & Information Lead, you will be an experienced individual contributor responsible for the hands-on execution and operational excellence across IT, Security, Technol...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Director of Engineering

    Director of Engineering

    Stanly Ranch • Napa, CA, US
    [job_card.full_time]
    Stanly Ranch, Auberge Collection is the resort of new Napa, bringing a bold, dynamic energy and creating an unrivaled destination and community itself. Set on over 700 acres of rolling vineyards and...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer, Modeling Focus

    Senior Systems Engineer, Modeling Focus

    Atomic Machines • Emeryville, CA, US
    [job_card.full_time]
    Atomic Machines is ushering in a new era of micromanufacturing with its Matter Compiler™ technology platform.This platform enables new classes of micromachines to be designed and built by pro...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Travel Speech Language Pathologist (SLP) - $2,320 per week in San Pablo, CA

    Travel Speech Language Pathologist (SLP) - $2,320 per week in San Pablo, CA

    AlliedTravelNetwork • San Pablo, CA, US
    [job_card.full_time]
    AlliedTravelNetwork is working with Core Medical Group to find a qualified Speech Language Pathologist (SLP) in San Pablo, California, 94806!. Client in CA seeking OT / SLP : LTC / SNF.We are looking for...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Line Cook at 1247 Solano

    Line Cook at 1247 Solano

    1247 Solano (Albany) • Albany, CA, US
    [job_card.full_time]
    We serve fast, fresh, and unique food inspired by traditional Asian cuisine in a fast paced, casual restaurant environment. We're looking for friendly, reliable, self motivated team members to j...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Software Engineer, Windows / Desktop Applications - Berkeley, USA

    Senior Software Engineer, Windows / Desktop Applications - Berkeley, USA

    Speechify • Berkeley, CA, US
    [job_card.full_time]
    The mission of Speechify is to make sure that reading is never a barrier to learning.Over 50 million people use Speechify's text-to-speech products to turn whatever they're reading – ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]