Member of Technical Staff: ML Infrastructure, Platform Engineeressential AI • San Francisco, CA, US

Member of Technical Staff : ML Infrastructure, Platform Engineer

essential AI • San Francisco, CA, US

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Job Description

About Us

Essential AI is building an open platform to fuel and accelerate AI breakthroughs globally. Our open models, robust tooling, reproducible pipelines, and evaluation frameworks are designed for collaboration and contribution, empowering others to build, iterate, and innovate faster.

Essential AI's technology and products have the means to shape AI advancements while supporting scalable and sustainable business models. Powerful AIs don't trace their origins to singular breakthroughs, but from an amalgam of improvements, incremental and large. Essential AI creates the ideal environment to catalyze these advancements, enabling a steady path to sustained frontier capabilities.

The Role

The ML Infra Platform Engineer will be responsible for architecting and building the compute infra that powers the training and serving of our models. This requires a full understanding of the complete backend stack → from frameworks to compilers to runtimes to kernels.

Running and training models at scale often requires solving novel system problems. As an Infra Systems Engineer, you'll be responsible for identifying these problems and then developing systems that optimize the throughput and robustness of distributed systems. With proven experience building large-scale platforms, you will be responsible for building and advancing our systems that allow research and engineering organizations to iteratively develop, test, and deploy new features reliably, with high velocity, and with a frictionless-fast development cycle.

What you’ll be working on

You will help oversee and drive the vision of how we should build, test, and deploy models, while taking ownership and transform state-of-the-art development experience for research

Design, build, and maintain scalable machine learning infrastructure to support our model training, inference and applications

Design and implement scalable machine learning and distributed systems that enable training and scaling of LLMs. Work on parallelism methods improve training of in a fast and reliable way

Working on lower levels of the stack to build high-performing and optimal training and serving infrastructure, including researching new techniques and writing custom kernels as needed to achieve improvements

Develop tools and frameworks to automate and streamline ML experimentation and management

Collaborate with other researchers and product engineers to bring magical product experiences through large language models

Be willing to optimize performance and efficiency across different accelerators

What we are looking for

A strong understanding of architectures of new AI accelerators like GPU, TPU, IPU, HPU etc and their tradeoffs. Knowledge of parallel computing concepts and distributed systems.

Experience with Kernels, Low precision training, MoE.

Prior experience in performance tuning of training and / or inference LLM workloads. Experience with MLPerf or internal production workloads will be valued.

6+ years of relevant industry experience in leading the design of large-scale & production ML infra systems. Experience with Communication Libraries.

Experience with training and building large language models using frameworks such as Megatron, DeepSpeed, etc and deployment frameworks like vLLM, TGI, TensorRT-LLM etc

Comfortable with working under-the-hood with kernel languages like OAI Triton, Pallas and compilers like XLA

Experience with INT8 / FP8 training and inference, quantization and / or distillation

Knowledge of container technologies like Docker and Kubernetes and cloud platforms like AWS, GCP, etc.

Intermediate fluency with network fundamentals like VPC, Subnets, Routing Tables, Firewalls etc

We encourage you to apply for this position even if you don’t check all of the above requirements but want to spend time pushing on these techniques.

Essential AI commits to providing a work environment free of discrimination and harassment, as well as equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. You may view all of Essential AI’s recruiting notices here, including our EEO policy, recruitment scam notice, and recruitment agency policy.

[job_alerts.create_a_job]

Staff Infrastructure • San Francisco, CA, US

[internal_linking.similar_jobs]

Founding Engineer, ML Infrastructure

Reactor • San Francisco, CA, United States

[job_card.full_time]

Founding Infrastructure Engineer.This is a highly technical, high-impact role focused on designing and evolving the foundation that powers our AI platform. You'll work across the entire infrastructu...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Infrastructure Engineer (Menlo Park)

Strativ Group • Menlo Park, CA, US

[job_card.part_time]

We are partnered with a Stealth AI Lab (backed by top-tier investors and advised by pioneering figures in generative and interactive media) that is hiring a Staff ML Infrastructure Engineer.This co...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Infrastructure Engineer

Crusoe • San Francisco, CA, US

[job_card.full_time]

Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrif...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Infrastructure Engineer — Scalable Training for GenAI

Hedra, Inc • San Francisco, CA, United States

[job_card.full_time]

A pioneering generative media company is seeking an ML Engineer in San Francisco.The ideal candidate will have 3+ years of experience in high-performance computing and manage infrastructure for mac...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior ML Infrastructure Engineer

Gridware • San Francisco, CA, US

[job_card.full_time]

Gridware is a San Francisco-based technology company dedicated to protecting and enhancing the electrical grid.We pioneered a groundbreaking new class of grid management called active grid response...[show_more]

[last_updated.last_updated_1_hour] • [promoted] • [new]

ML Infrastructure Engineer (Staff / Principal)

Genesis Therapeutics Inc. • Burlingame, CA, United States

[job_card.full_time]

We’re a tight-knit team of proven drug hunters, deep learning researchers, and software engineers united by a common mission — drive AI innovation in biochemistry, discovering and developing ground...[show_more]

[last_updated.last_updated_1_day] • [promoted]

ML Infrastructure Engineer

BlueSpace • Oakland, CA, US

[job_card.full_time]

Unlike conventional autonomy software, our patented 4D Predictive Perception removes reliance on data.By leveraging next-gen 4D sensors, we can precisely predict the motion of all objects, increasi...[show_more]

[last_updated.last_updated_30] • [promoted]

Infrastructure Engineer, ML Systems

Appliedcompute • San Francisco, CA, United States

[job_card.full_time]

Applied Compute builds Specific Intelligence for enterprises, unlocking the knowledge inside a company to train custom models and deploy an in-house agent workforce. Today’s state-of-the-art AI isn’...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Infrastructure Engineer

worldcoin.org • San Francisco, CA, United States

[job_card.full_time]

World is a network of real humans, built on privacy-preserving proof-of-human technology, and powered by a globally inclusive financial network that enables the free flow of digital assets for all....[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Member of Technical Staff - Infrastructure

Asari AI • San Francisco, CA, US

[job_card.full_time]

Build AI to co-invent the future.Our mission is to empower people to invent complex systems and solve the world’s hardest problems, working together with scalable and reliable AI agents.Our t...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

AIML - Senior ML Infrastructure Engineer, ML Platform & Technologies - ML Compute

Apple Inc. • San Francisco, CA, United States

[job_card.full_time]

San Francisco Bay Area, California, United States Machine Learning and AI.Apple is where individual imaginations gather together, committing to the values that lead to great work.Every new product ...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

ML Infrastructure Engineer

Phizenix • Menlo Park, CA, US

[job_card.full_time] +1

Menlo Park, CA | On-Site | Full-Time / Direct Hire.Looking for ML Infra experts (Bay Area preferred) with deep experience in CUDA, GPU optimization, VLLMs, and LLM inference—pure language focus...[show_more]

[last_updated.last_updated_30] • [promoted]

Founding Infrastructure / Platform Engineer

Key Technology • San Francisco, CA, United States

[job_card.full_time]

Direct message the job poster from Key Technology.Global Talent Acquisition Partner | Scaling High-Growth Tech Startups | Placing Blockchain, AI & Machine Learning Superstars in NY.We’re hiring for...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Infrastructure Engineer

Cubiq Recruitment • San Francisco, CA, United States

[job_card.full_time]

Staff / Lead ML Infrastructure Engineer.Salary - Over market average + equity.We are building one of the world’s leading generative video and multimodal AI platforms, and we’re looking for a senior...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Infrastructure Engineer

Kubelt • San Francisco, CA, United States

[job_card.full_time]

World is a network of real humans, built on privacy‑preserving proof‑of‑human technology, and powered by a globally inclusive financial network that enables the free flow of digital assets for all....[show_more]

[last_updated.last_updated_30] • [promoted]

Staff Infrastructure Engineer

World • San Francisco, CA, United States

[job_card.full_time]

[last_updated.last_updated_variable_days] • [promoted]

ML Infrastructure Engineer, Safeguards

Anthropic • San Francisco, CA, United States

[job_card.full_time]

Anthropic’s mission is to create reliable, interpretable, and steerable AI systems.We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group ...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff ML Infrastructure Engineer - Scale & Inference

Snap Inc. • San Francisco, CA, United States

[job_card.full_time]

A leading tech company is seeking a Software Engineer for ML Infrastructure in San Francisco.This role involves designing high-performance systems for machine learning workloads, collaborating with...[show_more]

[last_updated.last_updated_variable_days] • [promoted]