Talent.com
Principal/Senior GPU Software Performance Engineer — Training at Scale
Principal/Senior GPU Software Performance Engineer — Training at ScaleAMD • San Jose, CA, United States
Principal / Senior GPU Software Performance Engineer — Training at Scale

Principal / Senior GPU Software Performance Engineer — Training at Scale

AMD • San Jose, CA, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Principal / Senior GPU Software Performance Engineer — Training at Scale

Base Pay Range

$226,400.00 / yr - $339,600.00 / yr

What You Do at AMD Changes Everything

At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

The Role

We train large models across multi‑GPU clusters. Your charter is to make training materially faster and cheaper by leading kernel‑level performance engineering—from math kernels and fused epilogues to cluster‑level throughput—partnering with researchers, framework teams, and infrastructure.

Key Responsibilities

  • Own kernel performance : Design, implement, and land high‑impact HIP / C++ kernels (e.g., attention, layernorm, softmax, GEMM / epilogues, fused pointwise) that are wave‑size portable and optimized for LDS, caches, and MFMA units.
  • Lead profiling & tuning : Build repeatable workflows with timelines, hardware counters, and roofline analysis; remove memory bottlenecks; tune launch geometry / occupancy; validate speedups with A / B harnesses.
  • Drive fusion & algorithmic improvements : Identify profitable fusions, tiling strategies, vectorized I / O, shared‑memory / scratchpad layouts, asynchronous pipelines, and warp / wave‑level collectives—while maintaining numerical stability.
  • Influence frameworks & libraries : Upstream or extend performance‑critical ops in PyTorch / JAX / XLA / Triton; evaluate and integrate vendor math libraries; guide compiler / codegen choices for target architectures.
  • Scale beyond one GPU : Optimize P2P and collective comms, overlap compute / comm, and improve data / pipeline / tensor parallelism throughput across nodes.
  • Benchmarking & SLOs : Define and own KPIs (throughput, time‑to‑train, $ / step, energy / step); maintain dashboards, perf CI gates, and regression triage.
  • Technical leadership : Mentor senior engineers, set coding / perf standards, lead performance “war rooms,” and partner with silicon / vendor teams on microarchitecture‑aware optimizations.
  • Quality & reliability : Build reproducible perf harnesses, deterministic test modes, and documentation / playbooks so improvements persist release‑over‑release.

Preferred Experience

  • Experience in systems / HPC / ML performance engineering, with hands‑on GPU kernel work and shipped optimizations in production training or HPC.
  • Expert in modern C++ (C++17+) and at least one GPU programming model (CUDA, HIP, or SYCL / oneAPI) or a GPU kernel DSL (e.g., Triton); comfortable with templates, memory qualifiers, atomics, and warp / wave‑level collectives.
  • Deep understanding of GPU microarchitecture : SIMT execution, occupancy vs. register / scratchpad pressure, memory hierarchy (global / L2 / shared or LDS), coalescing, bank conflicts, vectorization, and instruction‑level parallelism.
  • Proficiency with profiling & analysis : timelines and counters (e.g., Nsight Systems / Compute, rocprof / Omniperf, VTune / GPA or equivalents), ISA / disassembly inspection, and correlating metrics to code changes.
  • Proven track record reducing time‑to‑train or $‑per‑step via kernel and collective‑comms optimizations on multi‑GPU clusters.
  • Strong Linux fundamentals (perf / eBPF, NUMA, PCIe / links), build systems (CMake / Bazel), Python, and containerized dev (Docker / Podman).
  • Experience with distributed training (PyTorch DDP / FSDP / ZeRO / DeepSpeed or JAX) and GPU collectives.
  • Expertise in mixed precision (BF16 / FP16 / FP8), numerics, and stability / accuracy validation at kernel boundaries.
  • Background in compiler / IR (LLVM / MLIR) or codegen for GPU backends; ability to guide optimization passes with performance goals.
  • Hands‑on with cluster orchestration (Slurm / Kubernetes), IB / RDMA tuning, and compute / communication overlap strategies.
  • Academic Credentials

  • Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent.
  • Location

    San Jose, CA

    Benefits offered are described : AMD benefits at a glance.

    AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee‑based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third‑party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

    #J-18808-Ljbffr

    [job_alerts.create_a_job]

    Software Engineer • San Jose, CA, United States

    [internal_linking.related_jobs]
    Field Sales Representative

    Field Sales Representative

    AT&T • Holy, CA, US
    [job_card.full_time]
    Job Description : Join an elite group of sales professionals bringing customized, white glove experiences directly in the customer’s home. Field Sales Representatives at AT&T are driven to connect – ...[show_more]
    [last_updated.last_updated_1_day] • [promoted]
    Travel Speech Language Pathologist (SLP) - $1,746 to $2,104 per week in Santa Cruz, CA

    Travel Speech Language Pathologist (SLP) - $1,746 to $2,104 per week in Santa Cruz, CA

    Fusion Medical Staffing • Santa Cruz, CA, US
    [job_card.full_time]
    Travel Speech Language Pathologist.Facility in Santa Cruz, California.Fusion Medical Staffing is seeking a skilled Speech Language Pathologist for a 13-week travel assignment in Santa Cruz, Califor...[show_more]
    [last_updated.last_updated_1_day] • [promoted]
    Sr. System Engineer - GPU Servers (27156)

    Sr. System Engineer - GPU Servers (27156)

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a top-tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC, and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Clinical Supervisor (Master's Required)

    Clinical Supervisor (Master's Required)

    ACES • Santa Cruz, CA, US
    [job_card.full_time]
    ACES is driven to elevate the standards in the treatment of autism.Our team of Applied Behavior Analysis (ABA) clinicians is deeply committed to helping children with autism and related disorders r...[show_more]
    [last_updated.last_updated_30] • [promoted]
    CMM Programmer

    CMM Programmer

    South Bay Solutions • Fremont, CA, US
    [job_card.full_time]
    Provide Coordinate Measuring Machine (CMM) programming in CALYPSO Software.Must be a self-starting and highly motivated team player, with the ability to communicate and interact with manufacturing,...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Speech Language Pathologist / SLP PRN

    Speech Language Pathologist / SLP PRN

    BrightSpring Health Services • Santa Cruz, CA, United States
    [job_card.full_time]
    Speech Language Pathologist / SLP PRN.Rehab Without Walls Neuro Rehabilitation.Rehab Without Walls Neuro Rehabilitation.Are you an experienced Speech-Language Pathologist (SLP) looking to make a me...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Remote Sales & Trading Associate - AI Trainer ($50-$60 / hour)

    Remote Sales & Trading Associate - AI Trainer ($50-$60 / hour)

    Data Annotation • Santa Cruz, California
    [filters.remote]
    [job_card.full_time] +1
    We are looking for a finance professional to join our team to train AI models.You will measure the progress of these AI chatbots, evaluate their logic, and solve problems to improve the quality of ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    GEAR UP Director

    GEAR UP Director

    InsideHigherEd • Santa Cruz, California, United States
    [job_card.full_time]
    This position will be hybrid; on-site 3-4 days / week and remote 1-2 days / week.Sites will include schools in South Monterey County and Salinas and Pajaro Valleys. Some evening / weekend work may be nece...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    CUDA Kernels Engineer

    CUDA Kernels Engineer

    GenBio AI • Palo Alto, CA, US
    [job_card.full_time]
    Headquartered in Silicon Valley, we are a newly established start-up, where a collective of visionary scientists, engineers, and entrepreneurs are dedicated to transforming the landscape of biology...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    AlliedTravelNetwork • Santa Cruz, CA, US
    [job_card.full_time]
    AlliedTravelNetwork is working with LRS Healthcare to find a qualified Echo Tech in Santa Cruz, California, 95062!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits p...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Travel Speech Language Pathologist (SLP) - $1,782 to $2,058 per week in Santa Cruz, CA

    Travel Speech Language Pathologist (SLP) - $1,782 to $2,058 per week in Santa Cruz, CA

    AlliedTravelCareers • Santa Cruz, CA, US
    [job_card.full_time]
    AlliedTravelCareers is working with National Staffing Solutions to find a qualified Speech Language Pathologist (SLP) in Santa Cruz, California, 95060!. Details of the SLP opening in Santa Cruz, CA : ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior / Staff Platform Engineer / SRE

    Senior / Staff Platform Engineer / SRE

    Flow • Palo Alto, California, US
    [job_card.full_time]
    Senior / Staff Platform Engineer / SRE.Technology – Flow Engineering / Salaried, full-time / Hybrid.At Flow, we're on a mission to enhance living experiences across communities by lever...[show_more]
    [last_updated.last_updated_1_hour] • [promoted] • [new]
    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    Travel Echo Tech - $2,668 to $2,958 per week in Santa Cruz, CA

    LRS Healthcare • Santa Cruz, CA, US
    [job_card.full_time]
    Ready to start your next travel adventure? LRS Healthcare offers a full benefits package, 24 / 7 support, and a responsive, traveler-first culture. What are you waiting for? Apply today!.Valid license...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior AI Systems Performance Engineer

    Senior AI Systems Performance Engineer

    Sambanova Systems • Palo Alto, California, United States
    [job_card.full_time]
    The era of pervasive AI has arrived.In this era, organizations will use generative AI to unlock hidden value in their data, accelerate processes, reduce costs, drive efficiency and innovation to fu...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    (Senior) Software Engineer, Infrastructure (Kubernetes Platform)

    (Senior) Software Engineer, Infrastructure (Kubernetes Platform)

    pony.ai • Fremont, CA, US
    [job_card.full_time]
    Founded in 2016 in Silicon Valley, Pony.Operating Robotaxi, Robotruck and Personally Owned Vehicles (POV) business units, Pony. CNBC Disruptor list of the 50 most innovative and disruptive tech comp...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Top-Tier Silicon Valley Role With Competitive Compensation, Bonuses & High Growth Potential

    Top-Tier Silicon Valley Role With Competitive Compensation, Bonuses & High Growth Potential

    HealthEcareers - Client • Scotts Valley, California, United States
    [job_card.full_time]
    Find a Career Where You Can Thrive—Not Just Another Job.At Schweiger Dermatology Group, we offer an opportunity to grow and excel in a supportive and dynamic environment. New York, New Jersey, Penns...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal ASIC Verification Engineer

    Principal ASIC Verification Engineer

    VirtualVocations • Fremont, California, United States
    [job_card.full_time]
    A company is looking for a Principal ASIC Verification Engineer.Key Responsibilities Define overall SOC level verification strategy and technical planning Develop and drive UVM environments for ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Systems Engineer (Contract)

    Senior Systems Engineer (Contract)

    Blue Star Partners LLC • Pleasanton, CA, US
    [job_card.full_time]
    Pleasanton, CA – 100% onsite – Local candidates only.Strong potential for extension / direct hire.Hours over 40 will be paid at Time and a Half. The Senior Systems Engineer (Contract) will...[show_more]
    [last_updated.last_updated_30] • [promoted]