Talent.com
Infrastructure Engineering - Traffic
Infrastructure Engineering - TrafficxAI • San Francisco, CA, United States
Infrastructure Engineering - Traffic

Infrastructure Engineering - Traffic

xAI • San Francisco, CA, United States
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.

About the Role

In this role, you will be a key contributor to xAI’s Supercomputing team, focusing on building and optimizing scalable, high-performance traffic platforms that power our production inference engines. You will work on critical systems that manage traffic flow, service discovery, and network reliability across both on-premise and cloud-based Kubernetes clusters. Collaborating closely with Network Fabric Engineers and other technical teams, you will drive projects that enhance the stability and efficiency of our AI infrastructure, including support for large-scale training runs for advanced models like Grok 4 and beyond. This role demands deep technical expertise in Kubernetes, L4 / L7 proxies like Envoy, and service discovery systems, along with a proactive approach to debugging and optimizing complex network performance issues from L3 to L7.

What you’ll do

  • Build and optimize traffic platforms that automate and simplify the lifecycle of production inference engines across dozens of on-premise and cloud clusters, managing core traffic primitives like load balancing, routing, overload control, authentication / authorization, encryption in transit.
  • Manage, extend, and optimize xAI’s production inference capabilities with L4 / L7 proxies such as Envoy, NGINX.
  • Manage and extend xAI’s Service Discovery systems, both in and outside of Kubernetes (DNS, xDS control planes).
  • Collaborate with Network Fabric Engineers to improve host networking + fabric stability for large scale training runs (ie Grok 4 and beyond).
  • Work with a fast, small technical team to execute projects in the critical path of xAI.

What we’d like to see

  • 2+ years of experience operating Kubernetes clusters, or experience writing + deploying controllers.
  • 2+ years of experience configuring and deploying Envoy, NGINX, HAProxy, or some other L7 software load balancer.
  • 1+ years of experience deploying and configuring kubernetes CNI plugins (Calico, Cilium, Flannel) or experience with IPAM.
  • 1+ years of experience with DNS systems (ex : CoreDNS, Unbound) or service discovery control planes (xDS)
  • 1+ years of experience with cloud networking primitives (VPC Route Tables, Cloud NAT, Peering / Transit Gateways, CDN, Cloudflare Workers or equivalent)
  • Experience with host level network proxies (iptables, nftables, IPVS, eBPF programs) is a plus.
  • Deep experience with gRPC Client libraries (grpcio / grpc-go / grpc-java) is a plus.
  • Experience with service mesh (Istio, Linkerd) is a plus.
  • Demonstrated experience in working with Kubernetes and Envoy internals – can you tell us how k8s cached clients work? Can you tell us how Envoy scales and manages state?
  • Demonstrated experience debugging performance and reliability issues that span from L3 to L7 (ex : how would a gRPC client in a cloud environment call a gRPC server in an on-prem server? Describe the entire network path and any issues to watch out for, including Service Discovery / DNS, gRPC channel management, egress proxies, VPC routing, peering / PNI, edge caching / CDN, L4 loadbalancing devices, host networking + virtualization, k8s networking, L7 routing, TLS / authnz, TCP / IP)
  • Location

    This role is based in the Bay Area (San Francisco and Palo Alto). Candidates are expected to be located near the Bay Area or open to relocation.

    Envoy / xDS

    Golang and Rust

    Interview Process

    Application Review : Submit your CV and a statement of exceptional work. Our team will review your application to assess fit.

    Phone Interview (45 minutes) : A brief conversation with a team member to discuss your background, key accomplishments, and motivation.

    Main Interview Process

  • 2 Coding Assessments : Solve problems in a language of your choice.
  • Systems Hands-On : Demonstrate practical skills in a live problem-solving session.
  • Project Deep-Dive : Present your past exceptional work to a small audience.
  • Annual Salary Range

    $180,000 - $440,000 USD

    Base salary is just one part of our total rewards package at xAI, which also includes equity, comprehensive medical, vision, and dental coverage, access to a 401(k) retirement plan, short & long-term disability insurance, life insurance, and various other discounts and perks.

    Note

    We welcome a variety of formats, such as public writings, presentations, or publications. Submission is optional but highly encouraged.

    #J-18808-Ljbffr

    [job_alerts.create_a_job]

    Traffic Engineering • San Francisco, CA, United States

    [internal_linking.similar_jobs]
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Braintrust • San Francisco, CA, United States
    [job_card.full_time]
    Braintrust is building the modern platform for evaluating and deploying AI systems.Our mission is to help enterprises build trust in their AI by making it easy to test, monitor, and improve models ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Cloud Infrastructure Engineer - Mid to Staff Level

    Cloud Infrastructure Engineer - Mid to Staff Level

    HireTo by Kuvaka • San Francisco, CA, United States
    [job_card.full_time]
    Cloud Infrastructure Engineer - Mid to Staff Level.Cloud Infrastructure Engineer - Mid to Staff Level.Cloud Infrastructure Engineer - Mid to Staff Level. Be among the first 25 applicants.Cloud Infra...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Forward Deployed Infrastructure Engineer

    Forward Deployed Infrastructure Engineer

    Hyperbolic Labs, Inc. • San Francisco, CA, United States
    [job_card.full_time]
    Hyperbolic Labs is on a mission to democratize AI by breaking down the barriers to computing power with our Open-Access AI Cloud. By making better use of idle computing resources across the globe, w...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Infrastructure Deployment Engineer

    Infrastructure Deployment Engineer

    Cloudflare, Inc. • San Francisco, CA, United States
    [job_card.full_time]
    At Cloudflare, we are on a mission to help build a better Internet.Today the company runs one of the world's largest networks that powers millions of websites and other Internet properties for cust...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Infrastructure Platform Engineer

    Infrastructure Platform Engineer

    Fieldguide • San Francisco, CA, United States
    [job_card.full_time]
    This range is provided by Fieldguide.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fieldguide is establishing a new state of trust for global ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Founding Infrastructure Engineer

    Founding Infrastructure Engineer

    Adaption • San Francisco, CA, United States
    [job_card.full_time]
    Founding Infrastructure Engineer.We believe the future is adaptable, and not one‑size‑fits‑all.We will lead in real‑time efficient adaptation that combines algorithm with innovative interface desig...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Founding Infrastructure Engineer

    Founding Infrastructure Engineer

    Reducto • San Francisco, CA, United States
    [job_card.full_time]
    Reducto helps AI teams ingest real world enterprise data with state of the art accuracy.The vast majority of enterprise data — from financial statements to health records — is locked in unstructure...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Lead Platform Engineer (Network Infrastructure)

    Lead Platform Engineer (Network Infrastructure)

    Capital One • San Francisco, CA, United States
    [job_card.full_time] +1
    Lead Platform Engineer (Network Infrastructure) at Capital One.Network Design & Architecture – Plan and develop network infrastructure based on business needs. Create bill of materials and obtain qu...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Infrastructure Engineer

    Infrastructure Engineer

    Tempo • San Francisco, CA, United States
    [job_card.full_time]
    Tempo is a layer-1 blockchain purpose-built for stablecoins and real-world payments, born from Stripe’s experience in global payments and Paradigm’s expertise in crypto tech.Tempo’s payment-first d...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Infrastructure Engineer

    Infrastructure Engineer

    Langchain • San Francisco, CA, United States
    [job_card.full_time]
    At LangChain, our mission is to make intelligent agents ubiquitous.We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast.Our open source ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Founding Infrastructure / Platform Engineer

    Founding Infrastructure / Platform Engineer

    Key Technology • Alameda, CA, United States
    [job_card.full_time]
    We’re hiring for our client, a fast-growing voice-AI company based in.Founding Infrastructure / Platform Engineer.You’ll design scalable AWS environments, implement Infrastructure-as-Code with Terr...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Cloud Infrastructure Engineer

    Cloud Infrastructure Engineer

    Crusoe • San Francisco, CA, US
    [job_card.full_time]
    Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrif...[show_more]
    [last_updated.last_updated_1_hour] • [promoted] • [new]
    Infrastructure Engineer (Hybrid Cloud & Platform)

    Infrastructure Engineer (Hybrid Cloud & Platform)

    Aldea • San Francisco, CA, United States
    [job_card.full_time]
    Infrastructure Engineer (Hybrid Cloud & Platform).Aldea is a multi‑modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bot...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Cloud Infrastructure Engineer – Edge & Streaming Systems

    Cloud Infrastructure Engineer – Edge & Streaming Systems

    Specter • San Francisco, CA, United States
    [job_card.full_time]
    A tech startup specializing in physical AI is seeking an infrastructure software engineer to design, deploy, and scale distributed systems for their sensing and perception platform.This role involv...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Platform & Infrastructure Engineer

    Platform & Infrastructure Engineer

    MindsDB • San Francisco, CA, US
    [job_card.full_time]
    MindsDB is a fast-growing AI startup headquartered in San Francisco, California.MindsDB is an AI Analytics solution that connects to diverse data sources and applications then unifies structured an...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Infrastructure Engineer (Hybrid Cloud & Platform)

    Infrastructure Engineer (Hybrid Cloud & Platform)

    Aldea Inc • San Francisco, CA, United States
    [job_card.full_time]
    Location : US Remote / Bay Area.Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the ev...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Infrastructure Engineer

    Principal Infrastructure Engineer

    Center for Elders' Independence • Oakland, CA, US
    [job_card.full_time]
    The Center for Elders’ Independence.PACE (Program of All-Inclusive Care for the elderly) organization (PO) that uses an interdisciplinary team approach for care planning and implementing purp...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Infrastructure Engineer

    Infrastructure Engineer

    LangChain • San Francisco, CA, United States
    [job_card.full_time]
    At LangChain, our mission is to make intelligent agents ubiquitous.We provide the agent engineering platform and open source frameworks developers need to ship reliable agents fast.Our open source ...[show_more]
    [last_updated.last_updated_30] • [promoted]