Talent.com
Network Engineer, AI/ML Infrastructure
Network Engineer, AI/ML InfrastructureBoson AI • Santa Clara, CA, US
Network Engineer, AI / ML Infrastructure

Network Engineer, AI / ML Infrastructure

Boson AI • Santa Clara, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

Job Description

About The Role

We're seeking an experienced Network Engineer to design, build, and optimize the high-performance networking infrastructure powering our AI / ML operations in Toronto. You'll work at the cutting edge of network technology—managing InfiniBand and ultra-high-speed Ethernet fabrics that connect NVIDIA H100 and A100 GPUs, over 20PB of Ceph storage, and hundreds of servers.

You'll be hands-on with the full lifecycle of our network infrastructure : planning, building, testing, deploying, and keeping everything running at peak performance. That means troubleshooting issues as they arise, monitoring network performance and throughput, developing automation to streamline operations, and working closely with HPC and ML teams to ensure they have the bandwidth they need. You'll also help us plan for future capacity and evaluate emerging network technologies as we scale to meet increasingly demanding workloads.

Responsibilities

  • Configure and maintain InfiniBand and high-speed Ethernet fabrics
  • Optimize network performance for RDMA, and GPU-to-GPU communication
  • Manage network switches (Mellanox, NVIDIA, Micas Networks)
  • Troubleshoot network bottlenecks and latency issues
  • Plan and execute network upgrades and expansions
  • Network security implementation (firewalls, VLANs, ACLs)
  • Collaborate on storage network optimizationInfrastructure monitoring

Minimum Qualifications

  • 4+ years of network engineering experience in production environments
  • Strong understanding of L2 / L3 networking protocols (TCP / IP, BGP, OSPF, VLANs)
  • Hands-on experience with high-speed networking (100Gb+ Ethernet and InfiniBand)
  • Hands-on experience with network security (firewalls, ACLs, network segmentation)
  • Knowledge of HPC network topologies
  • Experience with InfiniBand fabrics including RDMA, RoCE, IPoIB
  • Strong troubleshooting and problem-solving skills
  • Preferred Qualifications

  • Experience in data center environments or AI / ML infrastructure
  • Hands-on experience with high-performance Ethernet switches (e.g., Broadcom Tomahawk), and latest InfiniBand switches (e.g., Nvidia / Mellanox)
  • Experience optimizing networks for GPU-to-GPU communication
  • Experience with open-source firewall solutions (OPNsense, pfSense, or similar)
  • Experience with network automation tools
  • Understanding of distributed storage networking (Ceph cluster networks)
  • Familiarity with network monitoring and observability tools (Prometheus, Grafana)
  • Knowledge of multi-site network connectivity and WAN optimization
  • Familiarity with cloud networking in at least one platform (AWS, GCP, or Azure) including VPC design, site-to-site VPN configuration, Direct Connect / ExpressRoute / Cloud Interconnect, hybrid cloud connectivity, and cloud-to-datacenter network integration
  • If you're a natural problem-solver with a passion for continuous learning, we'd love to hear from you.

    We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

    [job_alerts.create_a_job]

    Network Engineer • Santa Clara, CA, US

    [internal_linking.similar_jobs]
    System Architect, Networking

    System Architect, Networking

    SiTime Corporation • Santa Clara, CA, US
    [job_card.full_time]
    SiTime Corporation is the precision timing company.Our semiconductor MEMS programmable solutions offer a rich feature set that enables customers to differentiate their products with higher performa...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Advanced Micro Devices • Santa Clara, CA, United States
    [job_card.full_time]
    WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Principal Engineer - Performance AI / ML Network Deployment Engineering

    Advanced Micro Devices, Inc. • Santa Clara, CA, United States
    [job_card.full_time]
    WHAT YOU DO AT AMD CHANGES EVERYTHING.At AMD, our mission is to build great products that accelerate next‑generation computing experiences—from AI and data centers, to PCs, gaming and embedded syst...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    ML Infrastructure Engineer with GCP

    ML Infrastructure Engineer with GCP

    iSoftTek Solutions Inc • Mountain View, CA, US
    [job_card.full_time]
    Job Title : ML Infrastructure Engineer with GCP.Location : Mountain View, CA [Needs to be onsite for 1 week once in a quarter on your own expenses]. Note : Only PST and MST candidates are required.Expe...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Network Architect

    Network Architect

    Zentek Infosoft Inc. • San Jose, CA, United States
    [job_card.full_time]
    We are looking for an individual with capability to Engineer and architect large scale networks.They must be well versed in OSPF and BGP and have ability to speak to it at RFC protocol level, which...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Principal Engineer - AI Infrastructure Abstractions

    Principal Engineer - AI Infrastructure Abstractions

    Diversity Talent Scouts • San Jose, CA, US
    [job_card.full_time]
    Principal AI Infrastructure Abstraction Engineer.AI compute environments scalable, secure, and developer-friendly.Your work will focus on creating abstractions that hide hardware complexity while p...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Android AI ML Engineer - Infrastructure

    Android AI ML Engineer - Infrastructure

    Focuskpi • Mountain View, California, United States
    [job_card.temporary]
    Android AI ML Engineer - Infrastructure.The client is seeking an experienced Android AI / ML Engineer - Infrastructure to develop advanced on-device machine learning systems that enable secure, adapt...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Machine Learning Engineer (Computer Network Architect)

    Machine Learning Engineer (Computer Network Architect)

    cPacket Networks • Milpitas, CA, United States
    [job_card.full_time]
    Packet is a leading provider of next-generation Network Observability for the modern enterprise.Packet's solutions are the leading-edge foundation for network observability for enterprises, cloud a...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Apple Inc. • Sunnyvale, CA, United States
    [job_card.full_time]
    Machine Learning Infrastructure Engineer.Sunnyvale, California, United States Machine Learning and AI.Want to ship amazing experiences in Apple products? Be part of the team in the Video Computer V...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Networking AI Engineer (Remote) — Design & Scale AI

    Senior Networking AI Engineer (Remote) — Design & Scale AI

    NVIDIA Corporation • Santa Clara, CA, United States
    [filters.remote]
    [job_card.full_time]
    A leading technology company is seeking a Senior Software Engineer focused on Networking to provide expertise in AI networking systems. The ideal candidate will have experience with embedded systems...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Institute Of Foundation Models • Sunnyvale, California, United States
    [job_card.full_time]
    About the Institute of Foundation Models.We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Network System Engineer (27529)

    Network System Engineer (27529)

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Lead Machine Learning Infrastructure Engineer

    Lead Machine Learning Infrastructure Engineer

    Convene, Inc. • Mountain View, CA, United States
    [job_card.full_time]
    Lead Machine Learning Infrastructure Engineer.Tampa‑based, award‑winning technology services organization with offices and resources throughout the US, Mexico, and India. We have successful, referen...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    AI / ML Architect

    AI / ML Architect

    KlearNow.ai • Santa Clara, CA, US
    [job_card.full_time]
    AI / ML Architect Job Description.We are currently seeking a highly skilled and visionary AI Architect to join our dynamic team. As an AI Architect, you will be instrumental in shaping the AI strategy...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Machine Learning Infrastructure Engineer

    Machine Learning Infrastructure Engineer

    Institute of Foundation Models • Sunnyvale, CA, US
    [job_card.full_time]
    About the Institute of Foundation Models.We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Network Architect (F2F interiew)

    Network Architect (F2F interiew)

    Jobs via Dice • Santa Clara, CA, United States
    [job_card.full_time]
    This is a hands‑on architecture position focused on the development and deployment of ultra‑high‑speed, resilient, and scalable interconnects for GPU‑accelerated data centers and compute clusters.O...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Machine Learning Infrastructure Engineer

    Senior Machine Learning Infrastructure Engineer

    PlusAI • Santa Clara, CA, US
    [job_card.full_time]
    Plus, also known as PlusAI, is a Physical AI company pioneering AI-based virtual driver software for factory-built autonomous trucks. Headquartered in Silicon Valley with operations in the United St...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Lead Network Engineer - Core

    Lead Network Engineer - Core

    xAI • Palo Alto, CA, US
    [job_card.full_time]
    AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...[show_more]
    [last_updated.last_updated_30] • [promoted]