Staff ML Platform Engineer – Large Scale Training (LLMOps/MLOps)Socotra, Inc. • San Francisco, CA, United States

Staff ML Platform Engineer – Large Scale Training (LLMOps / MLOps)

Socotra, Inc. • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Build the Future of Scalable AI at TrueFoundry

At TrueFoundry , we’re redefining how ML teams train, deploy, and scale their models. Our LLMOps and MLOps platform empowers organizations to experiment faster, train large-scale models reliably, and deploy them seamlessly on Kubernetes—with the same muscle as Big Tech.

We're looking for ML Systems Engineers who are passionate about scaling deep learning workloads, optimizing multi-GPU training, and shipping production-grade solutions. If you live and breathe PyTorch, multi-node training, and love solving gnarly infra challenges—this is your place.

What You’ll Work On

Write clean, modular, and scalable Python code , with a strong emphasis on reliability and performance.
Build platform for training and finetuning large-scale ML models across multi-GPU, multi-node clusters with PyTorch, Kubeflow, and other orchestration tools.
Own the infrastructure and code that enables high-throughput, low-latency inference pipelines for state-of-the-art models.
Build platform for developing, deploying and evaluating agentic applications for our end customers.
Help shape internal standards and best practices across the engineering team for high-scale ML workloads.

What We’re Looking For

5+ years of hands-on experience building and deploying ML systems at scale.

5+ years of writing production quality high performance code.

Deep experience with multi-GPU / multi-node training , ideally with PyTorch as your primary framework.

Experience working with torch, high-level ML frameworks, and inference engines (vLLM or TensorRT).

Experience with Kubernetes is highly preferred; exposure to Kubernetes-native tools is a huge plus.

A pragmatic mindset—you know when to optimize and when to ship.

Bonus : Familiarity with open-source LLM training / fine-tuning.

Why Join TrueFoundry?

Work directly with ex-Facebook engineers and founders from IIT Kharagpur, UC Berkeley, and Y Combinator alumni .

First-hand exposure to building and scaling a deep-tech startup —insights you’ll carry if you want to start your own one day.

Be part of a fearlessly experimental culture focused on customer success and long-term impact.

Flexible hours, learning credits, and the opportunity to work shoulder-to-shoulder with the co-founders (Abhishek & Nikunj).

#J-18808-Ljbffr

[job_alerts.create_a_job]

Staff Engineer Platform • San Francisco, CA, United States

[internal_linking.related_jobs]

Staff ML Engineer

Grindr LLC • San Francisco, CA, United States

[job_card.full_time]

This is a hybrid role based in our San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer — Personalization & Recommendations

Icon Ventures • San Francisco, CA, United States

[job_card.full_time]

A leading educational technology company in San Francisco is seeking a Senior or Staff Machine Learning Engineer to design and implement large-scale recommendation systems.Ideal candidates will hav...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Systems Engineer : Distributed LLM Training & Inference

Scale AI • San Francisco, CA, United States

[job_card.full_time]

A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experien...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Infrastructure Engineer — Scalable Training for GenAI

Hedra, Inc • San Francisco, CA, United States

[job_card.full_time]

A pioneering generative media company is seeking an ML Engineer in San Francisco.The ideal candidate will have 3+ years of experience in high-performance computing and manage infrastructure for mac...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

[job_card.full_time]

LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer.LLM Training Frameworks and Optimization Engineer. LLM Training Frameworks and Optimization Engine...[show_more]

[last_updated.last_updated_30] • [promoted]

AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute

Apple Inc. • San Francisco, CA, United States

[job_card.full_time]

AIML - Staff ML Infrastructure Engineer, ML Platform & Technology - Pre-training Compute.San Francisco Bay Area, California, United States Machine Learning and AI. Apple is where individual imaginat...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

MLE, ML Platform

zaimler • San Mateo, CA, US

[job_card.full_time]

We’re creating the foundation for AI systems that don’t just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we’ve begun partnering with Fo...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior ML Systems Engineer : Scalable Training Frameworks

Cohere • San Francisco, CA, United States

[job_card.full_time]

A leading AI research firm located in San Francisco is seeking a Senior ML Systems Engineer to build and maintain the training framework for large-scale language models. The role involves designing ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer — Ads ML & Production

Google • San Francisco, CA, United States

[job_card.full_time]

A leading technology company is seeking a Staff ML Engineer to innovate in machine learning model design and contribute to building efficient systems. Candidates should have extensive software devel...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer

Grindr • San Francisco, CA, United States

[job_card.full_time]

San Francisco or Palo Alto offices (Palo Alto preferred) and will require you to be in the office on Tuesdays and Thursdays. What’s So Interesting About This Role?.At Grindr, we’re at the dawn of an...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff / Principal ML Ops Engineer

PRAGMATIKE • San Francisco, California, United States

[job_card.full_time]

[filters_job_card.quick_apply]

Cambridge, MA (Eastern Time / UTC -4).Pragmatike is hiring on behalf of a.AI startup recognized as a Top 10 GenAI company by GTM Capital. Staff / Principal ML Ops Engineer.ML infrastructure and prod...[show_more]

[last_updated.last_updated_1_day]

Machine Learning Engineer, Distributed & Scalable Training

Lila Sciences • San Francisco, California, United States

[job_card.full_time]

We’re seeking a ML Engineer specializing in.You’ll design and maintain large-scale training systems, optimize performance for massive models, and integrate cutting-edge techniques to improve effici...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior / Staff ML Engineer, Recommendations Systems

Grow Therapy • San Francisco, CA, United States

[job_card.full_time]

Grow Therapy is on a mission to serve as the trusted partner for therapists growing their practice, and patients accessing high‑quality care. Powered by technology, we are a three‑sided marketplace ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer - Hybrid, Equity, Scale ML Platform

Turo Inc • San Francisco, CA, United States

[job_card.full_time]

A leading car sharing platform in San Francisco is seeking a Staff Software Engineer to integrate machine learning models into their product experience. You will collaborate with various teams, buil...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

[job_card.full_time]

Training : ML Framework Engineer.Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs.W...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Machine Learning Platform Engineer

Faire • San Francisco, California, United States

[job_card.full_time]

Faire is an online wholesale marketplace built on the belief that the future is local — independent retailers around the globe are doing more revenue than Walmart and Amazon combined, but individua...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer — Personalization & Recs (SF Onsite)

Quizlet • San Francisco, CA, United States

[job_card.full_time]

A leading educational technology company is seeking a Senior or Staff Machine Learning Engineer in San Francisco.You will design and implement large-scale recommendation systems to enhance the lear...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff ML Engineer - Production ML & Platform Lead

Turo • San Francisco, CA, United States

[job_card.full_time]

A leading car sharing platform in San Francisco is looking for a Staff Software Engineer specializing in Machine Learning. This role involves collaborating with Product and Data Science teams, integ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]