Software Engineer, GPU Infrastructure - HPCOpenAI • San Francisco, CA, United States

Software Engineer, GPU Infrastructure - HPC

OpenAI • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

About the team

The Fleet team at OpenAI supports the computing environment that powers our cutting-edge research and product development. We oversee large-scale systems that span data centers, GPUs, networking, and more, ensuring high availability, performance, and efficiency. Our work enables OpenAI's models to operate seamlessly at scale, supporting both internal research and external products like ChatGPT. We prioritize safety, reliability, and responsible AI deployment over unchecked growth.

About the role

As a software engineer on the Fleet High Performance Computing (HPC) team, you will be responsible for the reliability and uptime of all of OpenAI's compute fleet. Minimizing hardware failure is key to research training progress and stable services, as even a single hardware hiccup can cause significant disruptions. With increasingly large supercomputers, the stakes continue to rise.

Being at the forefront of technology means that we are often the pioneers in troubleshooting these state-of-the-art systems at scale. This is a unique opportunity to work with cutting-edge technologies and devise innovative solutions to maintain the health and efficiency of our supercomputing infrastructure.

Our team empowers strong engineers with a high degree of autonomy and ownership, as well as ability to effect change. This role will require a keen focus on system-level comprehensive investigations and the development of automated solutions. We want people who go deep on problems, investigate as thoroughly as possible, and build automation for detection and remediation at scale.

In this role, you will

Build and maintain automation systems for provisioning and managing server fleets.

Develop tools to monitor server health, performance, and lifecycle events.

Collaborate with clusters, networking, and infrastructure teams.

Partner with external operators to ensure a high level of quality.

Identify and fix performance bottlenecks and inefficiencies.

Continuously improve automation to reduce manual work.

You might thrive in this role if you have

Experience managing large-scale server environments.

A balance of strengths in building and operationalizing.

Proficiency in Python, Go, or similar languages.

Strong Linux, networking, and server hardware knowledge.

Comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool.

Prior hardware expertise is not required for this role.

Bonus Skills

Experience with low level details of hardware components, protocols, and associated Linux tooling (e.g., PCIe, Infiniband, networking, power management, kernel perf tuning)

Knowledge of hardware management protocols (e.g., IPMI, Redfish).

High-performance computing (HPC) or distributed systems experience.

Prior experience developing, managing, or designing hardware.

Familiarity with monitoring tools (e.g., Prometheus, Grafana).

About OpenAI

OpenAI is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity. We push the boundaries of the capabilities of AI systems and seek to safely deploy them to the world through our products. AI is an extremely powerful tool that must be created with safety and human needs at its core, and to achieve our mission, we must encompass and value the many different perspectives, voices, and experiences that form the full spectrum of humanity.

We are an equal opportunity employer, and we do not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

For additional information, please see OpenAI\'s Affirmative Action and Equal Employment Opportunity Policy Statement.

Qualified applicants with arrest or conviction records will be considered for employment in accordance with applicable law, including the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance for Employers, and the California Fair Chance Act. For unincorporated Los Angeles County workers : we reasonably believe that criminal history may have a direct, adverse and negative relationship with the following job duties, potentially resulting in the withdrawal of a conditional offer of employment : protect computer hardware entrusted to you from theft, loss or damage; return all computer hardware in your possession (including the data contained therein) upon termination of employment or end of assignment; and maintain the confidentiality of proprietary, confidential, and non-public information. In addition, job duties require access to secure and protected information technology systems and related data security obligations.

To notify OpenAI that you believe this job posting is non-compliant, please submit a report through this form. No response will be provided to inquiries unrelated to job posting compliance.

We are committed to providing reasonable accommodations to applicants with disabilities, and requests can be made via this link.

OpenAI Global Applicant Privacy Policy

At OpenAI, we believe artificial intelligence has the potential to help people solve immense global challenges, and we want the upside of AI to be widely shared. Join us in shaping the future of technology.

Compensation Range : $325K - $590K

#J-18808-Ljbffr

[job_alerts.create_a_job]

Software Engineer Infrastructure • San Francisco, CA, United States

[internal_linking.similar_jobs]

GPU Systems Engineer - HPC / Parallel Computing

Vast.ai • San Francisco, CA, US

[job_card.full_time]

AI projects and businesses all over the world.We are democratizing and decentralizing AI computing—reshaping our future for the benefit of humanity. We are a small, growing, and highly motivat...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Software Engineer — AI Infra & GPU Systems

Voltage Park • San Francisco, CA, United States

[job_card.full_time]

A leading AI infrastructure company is seeking a Senior Software Engineer to help customers fully utilize their advanced AI capabilities. This role requires building and implementing services that e...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

GPU Systems Engineer : High-Performance C++

10X Recruiting Partners • San Francisco, CA, United States

[job_card.full_time]

A prominent recruiting firm is seeking a highly skilled Software Engineer (C++ Systems) to join a client’s team focused on GPU virtualization. The role requires optimizing performance at the systems...[show_more]

[last_updated.last_updated_1_day] • [promoted]

C++ Systems Engineer — GPU Virtualization, On-Site SF

Recruiting From Scratch • San Francisco, CA, United States

[job_card.full_time]

A leading talent firm is seeking a Software Engineer (C++ Systems) based in San Francisco, CA.The role involves building and optimizing a high-performance C++ GPU virtualization library and debuggi...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior AI Infra Engineer : Kubernetes & GPU Ops

Crusoe Energy Systems LLC • San Francisco, CA, United States

[job_card.full_time]

A rapidly growing technology company in San Francisco is seeking a Senior Software Infrastructure Engineer to manage cloud operations and develop automation tools. The ideal candidate will have stro...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Infrastructure Engineer : ML Inference & GPU Cloud (SF)

Relace • San Francisco, CA, United States

[job_card.full_time]

A tech company specializing in code generation is seeking an Infrastructure Engineer in San Francisco.You will design and operate systems for high-performance inference and training infrastructure,...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Infra Engineer : Kubernetes, GPUs & Automation

Crusoe • San Francisco, CA, United States

[job_card.full_time]

A cutting-edge technology company in San Francisco is seeking a Staff Infrastructure Engineer to manage cloud operations and develop automation tools for server provisioning.The role involves optim...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Software Engineer - C++ GPU Performance

Zoox • Foster City, CA, US

[job_card.full_time]

Zoox is building the world's most advanced self-driving hardware and software solution.The efficiency demands of such a system require an expert fine tuning of both the compute hardware archite...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

GPU Systems Engineer

SLR Search • San Francisco, CA, US

[job_card.full_time]

Architect the foundation of the future's most performance-critical cloud infrastructure.Starting Salary targeting $200,000 - $280,000. Comprehensive medical, dental, vision.GPU Systems Engineer ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior HPC & GPU Infrastructure Engineer

Sciforium • San Francisco, CA, United States

[job_card.full_time]

Senior HPC & GPU Infrastructure Engineer.We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster.You ...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

RTL Engineer, PCIe

Eridu • San Francisco, CA, United States

[job_card.full_time]

Eridu AI is a Silicon Valley-based hardware startup pioneering infrastructure solutions that accelerate training and inference for large-scale AI models. Today’s AI performance is frequently limited...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Mobile Software Engineer (Flutter) for Field Ops & Grid

Gridware • San Francisco, CA, United States

[job_card.full_time]

A technology company in California is seeking a Mobile Software Engineer.The role involves developing cross-platform mobile applications to enhance grid reliability and safety.Ideal candidates shou...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Automation Controls Engineer

Cepheid • Redwood City, California, United States of America

[job_card.full_time]

Are you ready to accelerate your potential and make a real difference within life sciences, diagnostics and biotechnology?. You’ll thrive in a culture of belonging where you and your unique viewpoin...[show_more]

[last_updated.last_updated_30] • [promoted]

Software Engineer (Cloud Infrastructure)

Thunder Compute • San Francisco, CA, United States

[job_card.full_time]

This range is provided by Thunder Compute.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Direct message the job poster from Thunder Compute.Co‑...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Software Engineer, Infrastructure Applications

Apple Inc. • San Francisco, CA, United States

[job_card.full_time]

Software Engineer, Infrastructure Applications.San Francisco Bay Area, California, United States Software and Services.Apple is where individual imaginations gather together, committing to the valu...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff GPU Infra Engineer (HPC) - Remote-Flexible

Cohere • San Francisco, CA, United States

[filters.remote]

[job_card.full_time]

A leading AI infrastructure company in San Francisco is seeking a Staff Software Engineer to build and scale ML-optimized HPC infrastructure. This role involves managing Kubernetes-based GPU supercl...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

C++ Systems Engineer : GPU Virtualization & Ultra-Low Latency

SK HR Consultants.com • San Francisco, CA, United States

[job_card.full_time]

A leading consulting firm is seeking a Software Engineer (C++ Systems) in San Francisco to optimize microsecond-level performance in GPU virtualization software. Ideal candidates will have elite C++...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Engineer GPU infrastructure

DigitalOcean • San Francisco, CA, United States

[job_card.full_time]

Staff Engineer GPU infrastructure at.Contribute to rapidly growing GPU infrastructure service products within DO by providing security and operational best practices to infrastructure servers acros...[show_more]

[last_updated.last_updated_variable_days] • [promoted]