Talent.com
Principal/Senior GPU Software Performance Engineer — Training at Scale
Principal/Senior GPU Software Performance Engineer — Training at ScaleAMD • San Jose, CA, United States
Principal / Senior GPU Software Performance Engineer — Training at Scale

Principal / Senior GPU Software Performance Engineer — Training at Scale

AMD • San Jose, CA, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Principal / Senior GPU Software Performance Engineer — Training at Scale

Base Pay Range

$226,400.00 / yr - $339,600.00 / yr

What You Do at AMD Changes Everything

At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

The Role

We train large models across multi‑GPU clusters. Your charter is to make training materially faster and cheaper by leading kernel‑level performance engineering—from math kernels and fused epilogues to cluster‑level throughput—partnering with researchers, framework teams, and infrastructure.

Key Responsibilities

  • Own kernel performance : Design, implement, and land high‑impact HIP / C++ kernels (e.g., attention, layernorm, softmax, GEMM / epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units.
  • Lead profiling & tuning : Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry / occupancy; validate speedups with A / B harnesses.
  • Drive fusion & algorithmic improvements : Identify profitable fusions, tiling strategies, vectorized I / O, shared‑memory / scratchpad layouts, asynchronous pipelines, and warp / wave‑level collectives—while maintaining numerical stability.
  • Influence frameworks & libraries : Upstream or extend performance‑critical ops in PyTorch / JAX / XLA / Triton; evaluate and integrate vendor math libraries; guide compiler / codegen choices for target architectures.
  • Scale beyond one GPU : Optimize P2P and collective comms, overlap compute / comm, and improve data / pipeline / tensor parallelism throughput across nodes.
  • Benchmarking & SLOs : Define and own KPIs (throughput, time‑to‑train, $ / step, energy / step); maintain dashboards, perf CI gates, and regression triage.
  • Technical leadership : Mentor senior engineers, set coding / perf standards, lead performance “war rooms,” and partner with silicon / vendor teams on microarchitecture‑aware optimizations.
  • Quality & reliability : Build reproducible perf harnesses, deterministic test modes, and documentation / playbooks so improvements persist release‑over‑release.

Preferred Experience

  • Experience in systems / HPC / ML performance engineering, with hands‑on GPU kernel work and shipped optimizations in production training or HPC.
  • Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL / oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp / wave‑level collectives.
  • Deep understanding of GPU microarchitecture : SIMT execution, occupancy vs. register / scratchpad pressure, memory hierarchy (global / L2 / shared or LDS), coalescing, bank conflicts, vectorization, and instruction‑level parallelism.
  • Proficiency with profiling & analysis : timelines and counters (e.g., Nsight Systems / Compute, rocprof / Omniperf, VTune / GPA or equivalents), ISA / disassembly inspection, and correlating metrics to code changes.
  • Proven track record reducing time‑to‑train or $‑per‑step via kernel and collective‑comms optimizations on multi‑GPU clusters.
  • Strong Linux fundamentals (perf / eBPF, NUMA, PCIe / links), build systems (CMake / Bazel), Python, and containerized dev (Docker / Podman).
  • Experience with distributed training (PyTorch DDP / FSDP / ZeRO / DeepSpeed or JAX) and GPU collectives.
  • Expertise in mixed precision (BF16 / FP16 / FP8), numerics, and stability / accuracy validation at kernel boundaries.
  • Background in compiler / IR (LLVM / MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals.
  • Hands‑on with cluster orchestration (Slurm / Kubernetes), IB / RDMA tuning, and compute / communication overlap strategies.
  • Academic Credentials

  • Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent.
  • Location

    San Jose, CA

    Benefits offered are described : AMD benefits at a glance.

    AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee‑based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third‑party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

    #J-18808-Ljbffr

    [job_alerts.create_a_job]

    Software Engineer • San Jose, CA, United States

    [internal_linking.related_jobs]
    Audiologist - Campbell CA

    Audiologist - Campbell CA

    MRG Exams • Scotts Valley, CA, US
    [job_card.part_time]
    Are you a Licensed Audiologist looking to take on an assessment role?.Would you find it rewarding to serve the Veteran community?. We are looking for an Audiologist to perform medical assessments on...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Supervisor Imaging Tech Operations

    Supervisor Imaging Tech Operations

    CommonSpirit Health • Santa Cruz, CA, United States
    [job_card.full_time]
    Dignity Health Medical Foundation, established in 1993, is a California nonprofit public benefit corporation with care centers throughout California. Dignity Health Medical Foundation is an affiliat...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Bilingual Spanish Field Sales Representative

    Bilingual Spanish Field Sales Representative

    AT&T • Holy, California, US
    [job_card.full_time]
    Job Description : Join an elite group of sales professionals bringing customized, white glove experiences directly in the customer's home. Field Sales Representatives at AT&T are driven to connect - ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Software Engineer for Athlete Performance Research

    Senior Software Engineer for Athlete Performance Research

    Another Source • Palo Alto, CA, United States
    [job_card.part_time]
    Do you want to be a key contributor to an award-winning, open-source software project whose mission is to transform what we know about human performance? Are you interested in helping athletes of a...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Computer Science and Engineering Department : CROSS Practitioner in Residence Pool

    Computer Science and Engineering Department : CROSS Practitioner in Residence Pool

    University of California - Santa Cruz • Santa Cruz, CA, United States
    [job_card.full_time]
    CROSS Practitioner in Residence (Junior, Assistant, Associate and Specialist ranks) .Commensurate with qualifications and experience. Represented Specialist Series Fiscal Year.A reasonable estima...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Speech Language Pathologist / SLP PRN

    Speech Language Pathologist / SLP PRN

    BrightSpring Health Services • Santa Cruz, CA, United States
    [job_card.full_time]
    Speech Language Pathologist / SLP PRN.Rehab Without Walls Neuro Rehabilitation.Rehab Without Walls Neuro Rehabilitation.Are you an experienced Speech-Language Pathologist (SLP) looking to make a me...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Supervisor Imaging Tech Operations

    Supervisor Imaging Tech Operations

    Unavailable • Santa Cruz, CA, United States
    [job_card.full_time]
    Dignity Health Medical Foundation, established in 1993, is a California nonprofit public benefit corporation with care centers throughout California. Dignity Health Medical Foundation is an affiliat...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Therapy - SLP

    Therapy - SLP

    DOMINICAN HOSP GENERAL OP-SANTA CRUZ,CA • Santa Cruz, CA, United States
    [job_card.full_time]
    DOMINICAN HOSP GENERAL OP-SANTA CRUZ,CA.Are you ready to take your Travel career to the next level? See places you have not seen before? Ventura's MedStaff tenured Recruiters are here to help you f...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    GEAR UP Director

    GEAR UP Director

    InsideHigherEd • Santa Cruz, California, United States
    [job_card.full_time]
    This position will be hybrid; on-site 3-4 days / week and remote 1-2 days / week.Sites will include schools in South Monterey County and Salinas and Pajaro Valleys. Some evening / weekend work may be nece...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Remote Financial Manager - AI Trainer ($150 per hour)

    Remote Financial Manager - AI Trainer ($150 per hour)

    Mercor • Santa Cruz, California, US
    [filters.remote]
    [job_card.full_time]
    UK / Canada / Europe / Singapore / Dubai / Australia-based • •Investment Banking or Private Equity Experts • • for a research project with a leading foundational model AI lab. You are a good fit if you : - Have •...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Software Engineer - ML Performance

    Software Engineer - ML Performance

    Baseten • San Ramon, California, United States
    [job_card.full_time]
    We’re a growing team of builders backed by top-tier investors, including.ML teams at enterprises and category-defining AI-native companies like. Baseten to power their core production workloads with...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    AlliedTravelNetwork • Santa Cruz, CA, US
    [job_card.full_time]
    AlliedTravelNetwork is working with LRS Healthcare to find a qualified Echo Tech in Santa Cruz, California, 95062!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits p...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Travel Speech Language Pathologist (SLP) - $1,782 to $2,058 per week in Santa Cruz, CA

    Travel Speech Language Pathologist (SLP) - $1,782 to $2,058 per week in Santa Cruz, CA

    AlliedTravelCareers • Santa Cruz, CA, US
    [job_card.full_time]
    AlliedTravelCareers is working with National Staffing Solutions to find a qualified Speech Language Pathologist (SLP) in Santa Cruz, California, 95060!. Details of the SLP opening in Santa Cruz, CA : ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Scala Backend Engineer

    Senior Scala Backend Engineer

    Intellipro Group Inc. • San Jose, California, United States
    [job_card.full_time]
    Job Title : Senior Scala Backend Engineer.Duration : 4+ Months (Possible Conversion or Project Extension).Design, develop, and maintain scalable, distributed applications using Scala.Implement low-la...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    LRS Healthcare • Santa Cruz, CA, US
    [job_card.full_time]
    Ready to start your next travel adventure? LRS Healthcare offers a full benefits package, 24 / 7 support, and a responsive, traveler-first culture. What are you waiting for? Apply today!.Valid license...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Top-Tier Silicon Valley Role With Competitive Compensation, Bonuses & High Growth Potential

    Top-Tier Silicon Valley Role With Competitive Compensation, Bonuses & High Growth Potential

    HealthEcareers - Client • Scotts Valley, California, United States
    [job_card.full_time]
    Find a Career Where You Can Thrive—Not Just Another Job.At Schweiger Dermatology Group, we offer an opportunity to grow and excel in a supportive and dynamic environment. New York, New Jersey, Penns...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    General Neurologist BC / BE

    General Neurologist BC / BE

    Palo Alto Foundation Medical Group • Santa Cruz, US
    [job_card.full_time]
    Palo Alto Foundation Medical Group (PAFMG) is seeking a full-time BC / BE General Neurologist.General Neurology, no pain management. Average of 10-13 patients per day.Opportunity for rapid practice gr...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Travel Telemetry RN - Neuro / Spine - $2,377 per week

    Travel Telemetry RN - Neuro / Spine - $2,377 per week

    American Traveler • Santa Cruz, CA, United States
    [job_card.full_time] +2
    American Traveler is seeking a travel nurse RN Telemetry Med Surg for a travel nursing job in Santa Cruz, California.Job Description & Requirements. American Traveler is seeking an experienced RN fo...[show_more]
    [last_updated.last_updated_1_day] • [promoted]