Talent.com
Senior Site Reliability Engineer GPU Infrastructure
Senior Site Reliability Engineer GPU InfrastructureGenmo • San Francisco, California, United States
Senior Site Reliability Engineer GPU Infrastructure

Senior Site Reliability Engineer GPU Infrastructure

Genmo • San Francisco, California, United States
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the boundaries of what's possible in video generation.

What You’ll Do

Own the design and day‑to‑day operation of GPU clusters that train and serve frontier generative models.

Lead production Kubernetes operations : GPU scheduling, cluster upgrades, multi‑cluster federation.

Define and implement Infrastructure‑as‑Code (Terraform, Helm, Ansible) and GitOps workflows with Argo CD or Flux.

Build CI / CD pipelines, automated testing, and rollout strategies for infra changes.

Develop an observability stack (Prometheus, Grafana, OpenTelemetry, eBPF) plus GPU telemetry with NVIDIA DCGM.

Optimize high‑performance networking (InfiniBand / RDMA) and debug perf bottlenecks.

Run and continuously improve the 24×7 on‑call rotation; lead post‑incident reviews.

Partner with researchers and engineers, communicate crisply, and ship with a high‑ownership mindset.

Minimum Qualifications

BS / MS / PhD in CS, EE, or related field.

3+ yrs SRE / DevOps in production; 2+ yrs managing large Kubernetes fleets.

Expert‑level Kubernetes experience.

Proficient in Python and Bash and IaC tools (Terraform, Helm, Ansible).

Track record of shipping and operating large‑scale infrastructure with high reliability and clear communication.

Nice to Have

Multi‑cluster / multi‑cloud (AWS, GCP, Azure, bare‑metal) production experience.

Hands‑on with containerized GPU stacks (nvidia‑container‑toolkit, GPU Operator)

GPU schedulers such as Slurm or Kueue.

Familiarity with CI / CD tooling (GitHub Actions, BuildKit).

Prior work with distributed training, model‑serving patterns, or other ML / GPU workloads.

Machine‑learning depth is a plus—not a prerequisite. We’ll help you level up if needed.

Genmo is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law. Genmo, Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish .

[job_alerts.create_a_job]

Senior Site Reliability Engineer • San Francisco, California, United States

[internal_linking.related_jobs]
Senior Technology Site Reliability Engineer

Senior Technology Site Reliability Engineer

Cooley LLP • San Francisco, CA, United States
[job_card.full_time]
Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Principal Site Reliability Engineer

Principal Site Reliability Engineer

Harrison Clarke • San Francisco, CA, US
[job_card.full_time]
Harrison Clarke are working with several high profile companies that are seeking a Principal Site Reliability Engineer (SRE) , to lead the design, implementation, and scaling of the infrastructur...[show_more]
[last_updated.last_updated_30] • [promoted]
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Chainlink Labs • San Francisco, CA, United States
[job_card.full_time]
Senior Site Reliability Engineer.We’re looking for an experienced Site Reliability Engineer to join the Infrastructure Platform team, help builders at Chainlink, and accelerate delivery of internal...[show_more]
[last_updated.last_updated_30] • [promoted]
Senior Site Reliability Engineer, Compute

Senior Site Reliability Engineer, Compute

Roblox • San Mateo, California, USA
[job_card.full_time]
The Infrastructure Compute Site Reliability Engineering (SRE) teams mission is to own and manage the successful operation of our underlying cell infrastructure system along with elements of service...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer – Scale & Reliability Leader

Site Reliability Engineer – Scale & Reliability Leader

Alchemy • San Francisco, California, United States
[job_card.full_time]
An established industry player is seeking an Infrastructure Engineer to enhance developer productivity and ensure product reliability. In this pivotal role, you will collaborate with a talented engi...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

Together • San Francisco, CA, US
[job_card.full_time]
As a Site Reliability Engineer (SRE) at Together, you are responsible for keeping all user-facing services and production systems running smoothly. You are a blend of a pragmatic operator and a soft...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Site Reliability Engineer - Deployer / Delivery

Senior Site Reliability Engineer - Deployer / Delivery

Okta • San Francisco, CA, United States
[job_card.full_time] +1
Okta is The World's Identity Company.We free everyone to safely use any technology, anywhere, on any device or app.Our flexible and neutral products, Okta Platform and Auth0 Platform, provide secur...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States
[job_card.full_time]
Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Loft Orbital • San Francisco, CA, United States
[job_card.full_time]
Senior Site Reliability Engineer.This range is provided by Loft Orbital.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.Loft Orbital is revoluti...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

Clay • San Francisco, California, United States
[job_card.full_time]
Site Reliability Engineer Join to apply for the Site Reliability Engineer role at Clay.About Clay Clay is a creative tool for growth. Our mission is to help businesses grow — without huge investment...[show_more]
[last_updated.last_updated_variable_hours] • [promoted] • [new]
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Hive • San Francisco, CA, United States
[job_card.full_time]
Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer - Scale & Observability

Site Reliability Engineer - Scale & Observability

gamma.app • San Francisco, CA, US
[job_card.full_time]
A dynamic tech firm located in San Francisco is seeking a Site Reliability Engineer to enhance operational health across their production systems. This high-impact role demands expertise in AWS and ...[show_more]
[last_updated.last_updated_1_day] • [promoted]
Senior Site Reliability Engineer

Senior Site Reliability Engineer

Circle • San Francisco, CA, United States
[job_card.full_time]
Senior Site Reliability Engineer at Circle.Circle is a financial technology company at the epicenter of the emerging internet of money. Our infrastructure—including USDC, a blockchain‑based dollar—h...[show_more]
[last_updated.last_updated_30] • [promoted]
Senior Staff Site Reliability Engineer - Platform

Senior Staff Site Reliability Engineer - Platform

Icon Ventures • San Francisco, CA, United States
[job_card.full_time]
At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, includin...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Site Reliability Engineer I

Site Reliability Engineer I

Prosper • San Francisco, CA, United States
[job_card.full_time]
As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer II

Site Reliability Engineer II

Hinge-Health • San Francisco, CA, United States
[job_card.full_time]
Site Reliability Engineers at Hinge Health are infrastructure engineers with a strong sense of ownership over the systems that keep our platform running reliably, securely, and efficiently.From sca...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer II

Site Reliability Engineer II

Hinge Health • San Francisco, CA, United States
[job_card.full_time]
From scaling Kubernetes clusters to improving observability with Datadog, we build the tooling and automation that empower product teams to ship with confidence. Collaborate with engineering teams t...[show_more]
[last_updated.last_updated_30] • [promoted]
Site Reliability Engineer

Site Reliability Engineer

Cypress HCM • San Mateo, CA, United States
[job_card.full_time]
As a Site Reliability Engineer (Contractor), you will be a hands-on contributor, focused on supporting and improving the reliability of our AWS cloud infrastructure. You will apply core SRE principl...[show_more]
[last_updated.last_updated_variable_days] • [promoted]