Talent.com
Senior GPU Supercomputer Scheduler Engineer
Senior GPU Supercomputer Scheduler EngineerNVIDIA • Santa Clara, CA, US
Senior GPU Supercomputer Scheduler Engineer

Senior GPU Supercomputer Scheduler Engineer

NVIDIA • Santa Clara, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

NVIDIA has continuously reinvented itself over two decades. Our invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined modern computer graphics, and revolutionized parallel computing. More recently, GPU deep learning ignited modern AI — the next era of computing. NVIDIA is a “learning machine” that constantly evolves by adapting to new opportunities that are hard to solve, that only we can take on, and that matter to the world. This is our life’s work, to amplify human imagination and intelligence. Join us today!

As a member of the GPU / HPC Infrastructure team, you will provide leadership in the design and implementation of groundbreaking GPU compute clusters that run demanding deep learning, high performance computing, and computationally intensive workloads. We seek a technology leader to identify architectural changes and / or completely new approaches for improving HPC schedulers for serving many simultaneous and large multi-node GPU workloads with many complex dependencies. This role offers you an excellent opportunity to deliver production grade solutions, get hands on with ground-breaking technology, and work closely with technical leaders solving some of the biggest challenges in machine learning, cloud computing, and system co-design.

What you'll be doing :

Design and develop enhancements to the HPC batch scheduler(s).

Work extensively with HPC scheduler vendor on bug fixes and feature releases

Provide support to staff and end users to resolve batch scheduler issues

Build and improve our ecosystem around GPU-accelerated computing

Performance analysis and optimizations of deep learning workflows

Develop large scale automation solutions

Root cause analysis and suggest corrective action for problems large and small scales

Finding and fixing problems before they occur

What we need to see :

Bachelor’s degree in Computer Science, Electrical Engineering or related field or equivalent experience with 5+ years of work experience

Strong understanding of HPC batch schedulers, such as Slurm, RTDA or LSF and HPC workflows that use MPI

Significant experience in Programming in C / C++ and advanced scripting in languages such as Python, Go, bash scripting

Established experience in Linux operating system, environment and tools

Accomplished in computer architecture and operating systems

Deep knowledge of Networking Protocols like InfiniBand, Ethernet

Experience analyzing and tuning performance for a variety of HPC workloads

In-depth understating of container technologies like Docker, Singularity, Podman

Flexibility / adaptability for working in a dynamic environment with different frameworks and requirements

Excellent communication, interpersonal and customer collaboration skills

Ways to stand out from the crowd :

Knowledge in MPI and High-performance computing

Background in RDMA technology

Experience in kernel programming

Open Source Software Contributor

Experience with deep learning frameworks like PyTorch and TensorFlow

Passionate about SW development processes

Want to make what was impossible possible!

The base salary range is 148,000 USD - 419,750 USD. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions.

You will also be eligible for equity and . NVIDIA accepts applications on an ongoing basis.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

[job_alerts.create_a_job]

Senior Scheduler • Santa Clara, CA, US

[internal_linking.related_jobs]
Principal GPU Software Engineer - Real-Time DNA Sequencing

Principal GPU Software Engineer - Real-Time DNA Sequencing

F. Hoffmann-La Roche AG • Santa Clara, CA, United States
[job_card.full_time]
A leading healthcare company in Santa Clara is seeking a Principal GPU Software Engineer I to develop GPU-accelerated software for DNA sequencing. The role requires strong skills in C / C++ programmin...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Spacecraft Avionics Lead Engineer (Sunnyvale)

Spacecraft Avionics Lead Engineer (Sunnyvale)

EVONA • Sunnyvale, CA, US
[job_card.part_time] +1
Spacecraft Avionics Lead Engineer.A pioneering space company developing next-gen orbital mobility solutions is hiring an. This is a rare opportunity to architect, build, and validate a complete flig...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Gateway Ops Engineer

Senior Gateway Ops Engineer

Tencent Americas • Palo Alto, CA, United States
[job_card.full_time]
Get AI-powered advice on this job and more exclusive features.Direct message the job poster from Tencent Americas.Own end-to-end operations of Tencent's overseas Elastic IP and load balancing gatew...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior PM, GPU Server Systems for Cloud & DC

Senior PM, GPU Server Systems for Cloud & DC

Super Micro Computer Spain, S.L. • San Jose, CA, United States
[job_card.full_time]
A prominent technology solutions provider in San Jose is seeking a Sr.Product Manager to promote GPU server system products for cloud and data center infrastructure. The ideal candidate will collabo...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Gateway Ops Engineer

Senior Gateway Ops Engineer

Tencent • Palo Alto, CA, United States
[job_card.full_time]
Own end-to-end operations of Tencent's overseas Elastic IP and load balancing gateway platform, covering user support ticket resolution, software deployment, on-call rotations, and critical inciden...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
Senior PCIe Engineer (Gen4 / 5 / 6)

Senior PCIe Engineer (Gen4 / 5 / 6)

Micron Technology, Inc • San Jose, CA, United States
[job_card.full_time]
A leading global semiconductor company in San Jose is seeking a Senior Electrical PCIe Engineer to craft and verify next-generation PCIe components. The ideal candidate will have a BS or MS in Elect...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
ASIC / SOC DV Engineer (Silicon Engineering)

ASIC / SOC DV Engineer (Silicon Engineering)

SpaceX • Sunnyvale, CA, United States
[job_card.permanent]
SpaceX was founded under the belief that a future where humanity is out exploring the stars is fundamentally more exciting than one where we are not. Today SpaceX is actively developing the technolo...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
GPU / AI Application Platform Architect - San Jose

GPU / AI Application Platform Architect - San Jose

TikTok • San Jose, CA, United States
[job_card.full_time]
GPU / AI Application Platform Architect - San Jose.Be among the first 25 applicants.Server platform team is responsible for architecting, designing and building best server and storage system to meet...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Simulation Engineer (San Jose)

Senior Simulation Engineer (San Jose)

Pentangle Tech Services | P5 Group • San Jose, CA, US
[job_card.part_time]
Role Software Simulation engineer with C#.Mandatory skills C#, Software Simulation, Python, Squish, Hardware and Automation tools etc. Simulator Development (C#), Automation (Squish), & Firmware / H...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
Senior ASIC Engineer - SDC

Senior ASIC Engineer - SDC

Cisco Systems, Inc. • San Jose, CA, United States
[job_card.full_time]
The application window is expected to close on 1 / 26 / 2026.The job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.This is an onsite ro...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior PCIe Engineer — Drive Next-Gen Subsystems

Senior PCIe Engineer — Drive Next-Gen Subsystems

Micron Technology, Inc. • San Jose, CA, United States
[job_card.full_time]
A leading semiconductor company in San Jose is looking for a Senior Electrical PCIe Engineer to design and verify next-generation PCIe components. You will collaborate across teams to ensure flawles...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior GPU Systems Engineer - Spark Acceleration

Senior GPU Systems Engineer - Spark Acceleration

NVIDIA Corporation • Santa Clara, CA, US
[job_card.full_time]
A leading technology firm in Santa Clara is looking for a Senior Systems Software Engineer to develop CUDA / C++ libraries for accelerating data processing. The ideal candidate will have over 12 years...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior GPU Performance Engineer — Scale Training

Senior GPU Performance Engineer — Scale Training

AMD • San Jose, CA, United States
[job_card.full_time]
A leading semiconductor company in San Jose is seeking a Principal / Senior GPU Software Performance Engineer to enhance multi-GPU model training performance. The role involves kernel performance opti...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Algorithm Application Engineer

Senior Algorithm Application Engineer

ASML US, LLC • San Jose, CA, United States
[job_card.full_time]
ASML US, including its affiliates and subsidiaries, bring together the most creative minds in science and technology to develop lithography machines that are key to producing faster, cheaper, more ...[show_more]
[last_updated.last_updated_30] • [promoted]
Senior AI Engineer, Time-Series Signal Processing

Senior AI Engineer, Time-Series Signal Processing

BrightAI Corporation • Palo Alto, CA, United States
[job_card.full_time]
Senior AI Engineer, Time-Series Signal Processing.Senior AI Engineer, Time-Series Signal Processing.AI is a high-growth Physical AI company transforming how businesses interact with the physical wo...[show_more]
[last_updated.last_updated_30] • [promoted]
Sr. System Engineer - GPU Servers (27156)

Sr. System Engineer - GPU Servers (27156)

Supermicro • San Jose, CA, United States
[job_card.full_time]
Supermicro is a top-tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC, and IoT / Embedded customers...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Principal GPU Software Engineer I

Principal GPU Software Engineer I

F. Hoffmann-La Roche Gruppe • Pleasanton, California, United States
[job_card.full_time]
At Roche you can show up as yourself, embraced for the unique qualities you bring.Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted ...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
High-Speed Hardware Design Engineer (San Jose)

High-Speed Hardware Design Engineer (San Jose)

Intelliswift - An LTTS Company • San Jose, CA, United States
[job_card.full_time]
High-speed digital board design.Microprocessor-based reference design.FPGA / PCIe / Retimers / PHY components.Hardware debugging (oscilloscope, analyzer, VNA). High-Speed Hardware Design Engineer.We...[show_more]
[last_updated.last_updated_variable_days] • [promoted]