Talent.com
Principal Engineer - AI Infrastructure Abstractions
Principal Engineer - AI Infrastructure AbstractionsDiversity Talent Scouts • San Jose, CA, US
Principal Engineer - AI Infrastructure Abstractions

Principal Engineer - AI Infrastructure Abstractions

Diversity Talent Scouts • San Jose, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

Job Description

As a Principal AI Infrastructure Abstraction Engineer , you will design and implement the foundational systems that make shared AI compute environments scalable, secure, and developer-friendly. Your work will focus on creating abstractions that hide hardware complexity while providing predictable, cloud-native interfaces for AI workloads.

This position bridges infrastructure and applied AI—turning raw GPUs and accelerators into programmable, elastic, and multi-tenant resources for both internal developers and enterprise clients.

Key Responsibilities

  • Architect abstractions that map logical compute constructs (vGPUs, GPU pools, workload queues) to physical devices.
  • Build APIs, services, and control planes that expose GPU and accelerator resources with strong isolation and quality-of-service guarantees.
  • Develop mechanisms for secure GPU sharing, including time-slicing, partitioning, and namespace isolation.
  • Work with orchestration and scheduling systems to ensure intelligent mapping of resources based on utilization, priority, and network topology.
  • Define policies for quotas, fair allocation, and resource elasticity in shared environments.
  • Integrate with AI / ML frameworks (PyTorch, TensorFlow, Triton, etc.) to optimize model training and inference workflows.
  • Deliver observability and monitoring capabilities that trace resource usage from logical abstractions to hardware.
  • Partner with platform security teams to strengthen access controls, onboarding processes, and tenant isolation.
  • Support internal developer adoption of abstraction APIs while maintaining high performance and low overhead.
  • Contribute to long-term compute platform strategy with a focus on modularity, abstraction, and scale.

Minimum Qualifications

  • Bachelor’s degree with 15+ years of experience, Master’s with 12+ years, or PhD with 8+ years.
  • Proven track record building production-grade infrastructure systems, preferably in Go, Python, or C++.
  • Strong experience with containerization and orchestration platforms (Kubernetes, Docker, KubeVirt).
  • Background in designing logical abstractions for compute, storage, or networking in multi-tenant systems.
  • Familiarity with integrating with machine learning platforms (e.g., PyTorch, TensorFlow, Triton, MLFlow).
  • Preferred Qualifications

  • Hands-on experience with GPU sharing, scheduling, or isolation (MIG, MPS, vGPUs, time-slicing, or device plugin models).
  • Deep knowledge of resource management : quotas, prioritization, fairness, elasticity.
  • Strong ability to think across hardware / software boundaries and design abstractions that scale.
  • [job_alerts.create_a_job]

    Principal Engineer Ai • San Jose, CA, US

    [internal_linking.related_jobs]
    Principal Engineer - High-Performance AI Infrastructure

    Principal Engineer - High-Performance AI Infrastructure

    Diversity Talent Scouts • San Jose, CA, US
    [job_card.full_time]
    Principal Engineer for HPC and AI Infrastructure.GPU utilization across large, mission-critical workloads.Working within our GPU Runtime & Systems team, you’ll focus on.GPU clusters deliv...[show_more]
    [last_updated.last_updated_30] • [promoted]
    AI / ML Principal Engineer

    AI / ML Principal Engineer

    Cisco Systems, Inc. • San Jose, CA, United States
    [job_card.full_time]
    The application window is expected to close on : January 5, 2025.NOTE : Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.Outshift by...[show_more]
    [last_updated.last_updated_variable_hours] • [promoted] • [new]
    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Advanced Micro Devices • Santa Clara, CA, United States
    [job_card.full_time]
    WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Data Engineer

    Principal Data Engineer

    Uber • Sunnyvale, California, United States
    [job_card.full_time]
    About the Role This is a Technical Data Leader position.The Data Engineering team focuses on building core Business Intelligence and Data Solutions for multiple business verticals at Uber, like Ube...[show_more]
    [last_updated.last_updated_variable_hours] • [promoted] • [new]
    Principal Platform Architect, Agentic AI

    Principal Platform Architect, Agentic AI

    NVIDIA • Santa Clara, CA, United States
    [job_card.full_time]
    Principal Platform Architect, Agentic AI.NVIDIA has been transforming accelerated computing with innovation that’s fueled by great technology—and amazing people. As part of Nvidia's applied AI team ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Ai Architect

    Principal Ai Architect

    Intapp • Palo Alto, CA, United States
    [job_card.full_time]
    Intapp’s Intelligent Cloud platform.This executive-level, hands-on role is critical to ensuring our technology ecosystem is scalable, integrated, and AI-enabled. You’ll collaborate across engineerin...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal AI Architect

    Principal AI Architect

    Intapp, Inc. • Palo Alto, CA, United States
    [job_card.full_time]
    Principal AI Architect • • • •Location : • • Palo Alto, CA • •About the Role • •As the • •Principal AI Architect • •, you will define and lead the technical vision, architecture, and strategy for Intapp’s Intell...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Design Engineer – Data Center & AI Systems

    Principal Design Engineer – Data Center & AI Systems

    Celestica • San Jose, CA, United States
    [job_card.full_time]
    A global technology firm in San Jose is seeking a Principal, Design Engineering to lead the design and development of software solutions for data center networking. The ideal candidate has 15-20 yea...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Software Engineer

    Principal Software Engineer

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Engineer, Inference

    Principal Engineer, Inference

    CoreWeave • Sunnyvale, CA, US
    [job_card.permanent]
    CoreWeave is The Essential Cloud for AI™.Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confi...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Software Engineer II - Elasticsearch - Query Engine, Database Internals

    Principal Software Engineer II - Elasticsearch - Query Engine, Database Internals

    Elastic • Mountain View, CA, United States
    [job_card.full_time]
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
    [last_updated.last_updated_30] • [promoted]
    IT Principal AI Engineer

    IT Principal AI Engineer

    Palo Alto Networks • Santa Clara, CA, United States
    [job_card.full_time]
    At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer and m...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal AI Architect - Agentic Verticals

    Principal AI Architect - Agentic Verticals

    Zoom • San Jose, California, United States
    [job_card.full_time]
    What you can expect As a Principal AI Architect specializing in Agentic AI, you will design and oversee the development of intelligent agent systems for various industry verticals as part of Zoom A...[show_more]
    [last_updated.last_updated_variable_hours] • [promoted] • [new]
    Principal Generative AI Engineer

    Principal Generative AI Engineer

    SAP SE • Palo Alto, CA, United States
    [job_card.full_time]
    At SAP, we keep it simple : you bring your best to us, and we'll bring out the best in you.We're builders touching over 20 industries and 80% of global commerce, and we need your unique talents to h...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Engineer- WiFi

    Principal Engineer- WiFi

    Commscope • Sunnyvale, California, US
    [job_card.full_time]
    RUCKUS Networks, part of CommScope, specializes in delivering high-performance networking solutions while focusing on creating purpose-driven networks that perform exceptionally well in challenging...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Architect, AI Networking

    Principal Architect, AI Networking

    NVIDIA Corporation • Santa Clara, CA, United States
    [job_card.full_time]
    Principal Architect, AI Networking page is loaded## Principal Architect, AI Networkinglocations : US, CA, Santa Clara : US, TX, Austin : US, TX, Remote : US, CO, Remote : US, OR, Remotetime ty...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Software Engineer - AI Systems

    Principal Software Engineer - AI Systems

    ODAIA • Sunnyvale, CA, United States
    [job_card.full_time]
    Design and implement large-scale, production-grade AI systems that integrate LLMs and Generative AI into real-world applications. Build frameworks that support Retrieval-Augmented Generation (RAG), ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Capacity Engineer

    Principal Capacity Engineer

    Commscope • Sunnyvale, California, US
    [job_card.full_time]
    RUCKUS Networks builds and delivers purpose-driven networks that perform in the tough, unique environments of the industries we serve. How You’ll Help Us Connect the World : .If you love solving compl...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]