Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)Socotra, Inc. • San Francisco, CA, United States

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

5+ years of hands-on experience building and deploying ML systems at scale.

5+ years of writing production quality high performance code.

Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.

Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).

Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.

A pragmatic mindset—you know when to optimize and when to ship.

Bonus : Familiarity with open-source LLM training / fine-tuning.

Why Join TrueFoundry?

Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .

First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.

Be part of a fearlessly experimental culture focused on customer success and long-term impact.

Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

#J-18808-Ljbffr

[job_alerts.create_a_job]

Staff Engineer Platform • San Francisco, CA, United States

[internal_linking.similar_jobs]

Staff Software Engineer, ML Performance Optimization

Zoox • Foster City, CA, US

[job_card.full_time]

Zoox is on a mission to reimagine transportation and ground-up build autonomous robotaxis that are safe, reliable, clean, and enjoyable for everyone. We are still in the early stages of deploying ou...[show_more]

[last_updated.last_updated_30] • [promoted]

MLE, ML Platform

zaimler • San Mateo, CA, US

[job_card.full_time]

We’re creating the foundation for AI systems that don’t just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we’ve begun partnering with Fo...[show_more]

[last_updated.last_updated_30] • [promoted]

ML Infrastructure Engineer — Scalable Training for GenAI

Hedra, Inc • San Francisco, CA, United States

[job_card.full_time]

A pioneering generative media company is seeking an ML Engineer in San Francisco.The ideal candidate will have 3+ years of experience in high-performance computing and manage infrastructure for mac...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

[job_card.full_time]

LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...[show_more]

[last_updated.last_updated_30] • [promoted]

ML Engineer

Phizenix • Menlo Park, California, United States

[job_card.full_time] +1

Client Opportunity | Through Phizenix.Phizenix, a certified minority and women-led recruiting firm, is hiring on behalf of an innovative generative AI startup that’s developing diffusion-based larg...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff / Principal ML Ops Engineer

PRAGMATIKE • San Francisco, CA, US

[job_card.full_time]

Cambridge, MA (Eastern Time / UTC -4).Pragmatike is hiring on behalf of a.AI startup recognized as a Top 10 GenAI company by GTM Capital. Staff / Principal ML Ops Engineer.ML infrastructure and prod...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior ML Platform Engineer (Staff) — Remote

Equilibrium Energy • San Francisco, CA, United States

[filters.remote]

[job_card.full_time]

A clean energy startup based in San Francisco is seeking Staff / Sr Staff Software Engineers with a passion for machine learning and a commitment to clean energy. The ideal candidate will have over ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Platform Engineer — LLM Training & Inference

Scale AI, Inc. • San Francisco, CA, United States

[job_card.full_time]

A leading AI technology company in San Francisco is seeking candidates to enhance their machine learning framework for large language models. Ideal applicants will have hands-on experience with mult...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Lead ML Platform Engineer for Distributed Training

1Five • San Francisco, CA, United States

[job_card.full_time]

A leading technology company in San Francisco is seeking a Staff Software Engineer to lead engineering efforts on ML Infrastructure. The ideal candidate will have deep expertise in model training an...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer - Hybrid, Equity, Scale ML Platform

Turo Inc • San Francisco, CA, United States

[job_card.full_time]

A leading car sharing platform in San Francisco is seeking a Staff Software Engineer to integrate machine learning models into their product experience. You will collaborate with various teams, buil...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Machine Learning Engineer, Distributed & Scalable Training

Lila Sciences • San Francisco, California, United States

[job_card.full_time]

We’re seeking a ML Engineer specializing in.You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve effici...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff ML Engineer

Grindr • San Francisco, CA, United States

[job_card.full_time]

San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior ML Platform Engineer – Scalable Training & MLOps

Apple Inc. • San Francisco, CA, United States

[job_card.full_time]

A leading technology company in San Francisco is seeking a Machine Learning Engineer to drive large-scale training initiatives and optimize ML systems. The ideal candidate will hold a Bachelor's deg...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Staff ML Engineer for LLMs & Production

Rippling • San Francisco, CA, United States

[job_card.full_time]

A leading HR and IT solutions firm located in San Francisco is seeking a Senior Staff Machine Learning Engineer.In this role, you'll collaborate with various teams to design and build scalable mach...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Reward-Model Training Engineer — Scale ML Pipelines

anthropic • San Francisco, CA, United States

[job_card.full_time]

A cutting-edge AI firm in New York is seeking a Research Engineer to manage and enhance reward model training.You will design and implement efficient training pipelines and collaborate closely with...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

[job_card.full_time]

Training : ML Framework Engineer.Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs.W...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff ML Engineer

Grindr LLC • San Francisco, CA, United States

[job_card.full_time]

This is a hybrid role based in our San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Staff ML Engineer - Remote, Scalable ML Systems

Affirm • San Francisco, CA, United States

[filters.remote]

[job_card.full_time]

A leading fintech company is seeking a Senior Staff Machine Learning Engineer to shape the future of machine learning.In this role, you will design and scale advanced ML systems, mentor engineers, ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]