Principal Engineer - Performance AI/ML Network Deployment EngineeringAdvanced Micro Devices • Santa Clara, CA, United States

Principal Engineer - Performance AI / ML Network Deployment Engineering

Advanced Micro Devices • Santa Clara, CA, United States

[job_card.variable_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

THE ROLE

The Principal Engineer DC GPU AI / ML Advanced Forward Deployment and Systems Engineering is a leadership position designed to optimize the design, roll-out and post-rollout management of AI / ML Fabrics. The candidate will be the technical interface between the customers and various internal engineering groups, field application engineers Leveraging extensive experience in large network architecture, Storage, AI / ML network deployments, and performance tuning, this role requires a disciplined approach to system triage, at-scale debug, and infrastructure optimization to ensure robust performance and efficient transitions from GPU production qualification to at-scale datacenter deployment.

THE PERSON

This position is for a Principal Engineer DC GPU AI / ML Advanced Forward Deployment and Systems Engineering with a focus on architecture, design, optimizing the compute, network, and storage and benchmarking the Machine Learning applications. You will be part of a team closely work with strategic customers and partners to enable large scale deployment of AMD CPU and GPU platforms. You will closely interface with ROCm software developers, DC GPU HW / FW / ASIC Teams, Field Engineering Teams, OEM / ODM partners, CSPs, and Marketing / Business Development teams. Must be self-motivated and possess the ability to work well within a team environment.

KEY RESPONSIBILITIES

Collaborate with strategic customers on scalable designs involving compute, networking, storage environment, work with industry partners, Internal teams to accelerate the deployment, adoption of various AI / ML models.
Engage system-level triage and at-scale debug of complex issues across hardware, firmware, and software, ensuring rapid resolution and system reliability.
Drive the ramp of Instinct-based large scale AI datacenter infrastructure based on NPI base platform hardware with ROCm, scaling up to pod and cluster level, leveraging the best in network architecture for AI / ML workloads.
Enhance tools and methodologies for large-scale deployments to meet customer uptime goals and exceed performance expectations.
Engage with clients to deeply understand their technical needs, ensuring their satisfaction with tailored solutions that leverage your past experience in strategic customer engagements and architectural wins.
Provide domain specific knowledge to other groups at AMD, share the lessons learnt to drive continuous improvement.
Engage with AMD product groups to drive resolution of application and customer issues
Develop and present training materials to internal audiences, at customer venues, and at industry conferences

PREFERRED EXPERIENCE

Expertise in networking and performance optimization for large-scale AI / ML networks, including network, compute, storage cluster design, modelling, analytics, performance tuning, convergence, scalability improvements.

Prefer candidates with solid, hands on expertise in at least one or more of 3 domains , namely compute, network, storage.

Demonstrated leadership in network architecture, hands on experience in RoCEv2 Design, VXLAN-EVPN, BGP, and Lossless Fabrics

Deep experience in working with large customers such as Cloud Service Providers and global enterprise customers

Proven leadership in engaging customers with diverse technical disciplines in avenues such as Proof of Concept, Competitive evaluations, Early Field Trials etc.

Direct experience in working with large customers and can operate with sense of urgency, own the problems and resolve it

Extensive experience in Python, Linux, Kernel modules, Application libraries, unless accompanied by other skill sets in the space.

Proven ability to influence design and technology roadmaps, leveraging a deep understanding of datacenter products and market trends.

Extensive hands-on Network deployment expertise and proven track record of delivering large projects on time. Cisco, Juniper or Arista Experience is required.

Direct, co-development / deployment experience in working with strategic customers / partners in bringing solutions to market.

Excellent communication level from engineer to mid-management to C-level of audience.

This is a Senior level role; no recent college graduates will be considered.

ACADEMIC CREDENTIALS

Bachelors, master's in computer science ,Engineering or related of experience

Ability to work well in a geographically dispersed team.

Certifications in Networking, AI / ML, or Cloud Technologies.

Benefits offered are described :

AMD benefits at a glance

Equal Opportunity Statement

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and / or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

#J-18808-Ljbffr

[job_alerts.create_a_job]

Principal Network Engineer • Santa Clara, CA, United States

[internal_linking.related_jobs]

Principal Engineer - High-Performance AI Infrastructure

Diversity Talent Scouts • San Jose, CA, US

[job_card.full_time]

Principal Engineer for HPC and AI Infrastructure.GPU utilization across large, mission-critical workloads.Working within our GPU Runtime & Systems team, you’ll focus on.GPU clusters deliv...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal ML Engineer : Applied AI & GenAI Innovation

Relha LLC • Sunnyvale, CA, United States

[job_card.full_time]

A leading retail technology firm in Sunnyvale is seeking a Principal Machine Learning Engineer to define and solve high-value problems using advanced AI and data science techniques.This role requir...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior PM : AI / HPC Network Orchestration

DriveNets • San Jose, CA, United States

[job_card.full_time]

A leading cloud networking company is looking for a Mid-Senior level Senior Product Manager to focus on AI / HPC network orchestration. The role involves driving innovative network management solution...[show_more]

[last_updated.last_updated_1_day] • [promoted]

Principal ML Engineer — GenAI & Large-Scale AI Systems

Walmart • Sunnyvale, CA, United States

[job_card.full_time]

A large retail company in California is looking for a Principal Machine Learning Engineer to lead AI and machine learning projects. This role involves developing and deploying scalable solutions, co...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal Machine Learning Engineer, Firefly

Adobe Inc. • San Jose, CA, United States

[job_card.full_time]

Changing the world through digital experiences is what Adobe is all about.We empower everyone—from emerging artists to global brands—to design and deliver exceptional digital experiences.Our passio...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal DevOps Engineer - ML / AI Algorithms

Roche Holdings Inc. • Santa Clara, CA, United States

[job_card.full_time]

At Roche you can show up as yourself, embraced for the unique qualities you bring.Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted ...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal System Networking Architect

NVIDIA • Santa Clara, CA, United States

[job_card.full_time]

Be among the first 25 applicants.Our technology is crucial for global innovators, scientists, researchers, and engineers, empowering them to transform their boldest concepts into tangible outcomes....[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Principal AI Engineer - Autonomous Driving Systems

California Jobs • Sunnyvale, CA, United States

[job_card.full_time]

A leading technology company is seeking a Senior Principal Engineer in Sunnyvale to drive the development of AI technology for autonomous driving systems. The ideal candidate will have over 10 years...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal Machine Learning Engineer

Cisco Systems • San Jose, CA, United States

[job_card.full_time]

We are an agile team with a startup feel and a strong bias for action.We move fast, embrace failure as part of the process, and stay focused on solving real‑world problems for defenders on the fron...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

AI / ML Principal Engineer

Cisco Systems, Inc. • San Jose, CA, United States

[job_card.full_time]

The application window is expected to close on : January 5, 2025.NOTE : Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.Outshift by...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Principal Machine Learning Engineer

ServiceNow, Inc. • Santa Clara, CA, United States

[job_card.full_time]

It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal Engineer MLOps (DLP Detection)

Palo Alto Networks • Santa Clara, CA, US

[job_card.full_time]

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal AI / ML Operations Engineer

BlackLine • Pleasanton, CA, United States

[job_card.full_time]

It's fun to work in a company where people truly believe in what they're doing!.At BlackLine, we're committed to bringing passion and customer focus to the business of enterprise applications.Since...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal Performance Engineer

Zoom • San Jose, CA, United States

[job_card.full_time]

Immigration sponsorship is not available for this position.What you can expectZoom is seeking a highly experienced and impactful Principal Performance Engineer to join our DevOps / SRE team.In this c...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal ML Engineer

Sanas • Palo Alto, California, United States, 94301

[job_card.full_time]

Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world’s first real-time speech transformation platform capable of accent translati...[show_more]

[last_updated.last_updated_30]

Principal AI / ML Engineer, Gen AI & LLM Ops Lead

JPMorgan Chase • Palo Alto, CA, United States

[job_card.full_time]

A leading financial services firm in Palo Alto is seeking a Principal AI / ML and Gen AI Engineer to enhance AI capabilities. The role encompasses designing scalable infrastructure on AWS, developing ...[show_more]

[last_updated.last_updated_1_day] • [promoted]

Principal Architect, AI Networking

NVIDIA Corporation • Santa Clara, CA, United States

[job_card.full_time]

Principal Architect, AI Networking page is loaded## Principal Architect, AI Networkinglocations : US, CA, Santa Clara : US, TX, Austin : US, TX, Remote : US, CO, Remote : US, OR, Remotetime ty...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Principal Generative AI Engineer

SAP SE • Palo Alto, CA, United States

[job_card.full_time]

At SAP, we keep it simple : you bring your best to us, and we'll bring out the best in you.We're builders touching over 20 industries and 80% of global commerce, and we need your unique talents to h...[show_more]

[last_updated.last_updated_variable_days] • [promoted]