LLM Training Frameworks and Optimization EngineerTogether AI • San Francisco, CA, United States

LLM Training Frameworks and Optimization Engineer

Together AI • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

LLM Training Frameworks and Optimization Engineer

Join to apply for the LLM Training Frameworks and Optimization Engineer role at Together AI

Role

At Together.ai, we are building cutting-edge infrastructure to enable efficient and scalable training of large language models (LLMs). We focus on optimizing training frameworks, algorithms, and infrastructure to push the boundaries of AI performance, scalability, and cost-efficiency.

Role

We are seeking a LLM Training Frameworks and Optimization Engineer to drive innovations in the development and optimization of distributed training frameworks. In this role, you will ensure that our LLM training pipelines are robust, efficient, and capable of handling the complexities of large-scale distributed systems.

Responsibilities

Framework Development and Optimization :
Design, implement, and optimize distributed training frameworks tailored for large language models.
Develop custom modules, plugins, and features to enhance framework scalability and performance.
Algorithmic and Systems Optimization :
Optimize communication patterns (e.g., gradient synchronization, all-reduce) in distributed training.
Implement techniques like mixed precision, tensor parallelism, pipeline parallelism, and sharded training.
Performance Tuning :
Conduct in-depth profiling and debugging of training jobs to identify and resolve bottlenecks.
Collaborate with hardware teams to optimize performance for GPUs, TPUs, and other accelerators.
Scalability and Resilience :
Ensure training systems scale efficiently to thousands of nodes and petabytes of data.
Develop resilience mechanisms for fault-tolerant and checkpointed training pipelines.
Collaboration and Support :
Work closely with researchers, data engineers, and platform teams to ensure training frameworks meet model and workload requirements.
Provide guidance and tools to improve the overall efficiency of the LLM development lifecycle.

Requirements

Must-Have :

Experience :

5+ years of experience in deep learning frameworks, distributed systems, or machine learning infrastructure.

Technical Skills :

Expertise in distributed training frameworks (e.g., PyTorch DDP, DeepSpeed, Megatron-LM, TensorFlow XLA).

Strong understanding of parallelism techniques (e.g., data, tensor, pipeline, and ZeRO-based parallelism).

Familiarity with GPU / TPU hardware and deep learning performance optimizations.

Programming :

Proficient in Python and C++ or CUDA for high-performance computing.

Optimization Techniques :

Experience with memory optimization techniques (e.g., activation checkpointing, gradient sharding).

Knowledge of training dynamics for large-scale LLMs, including hyperparameter tuning and optimization.

Soft Skills :

Analytical problem-solving skills and a focus on performance improvement.

Strong collaboration and communication skills across teams.

Nice-to-Have

Familiarity with graph optimization and compiler-level performance tuning.

Contributions to open-source deep learning or distributed training projects.

Experience with low-level hardware optimizations (e.g., kernel fusion, custom CUDA kernels).

About Together AI

Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers in our journey in building the next generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is : $160,000 - $230,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Please see our privacy policy at https : / / www.together.ai / privacy

Seniority level

Mid-Senior level

Employment type

Full-time

Job function

Engineering and Information Technology

Industries

Software Development

Referrals increase your chances of interviewing at Together AI by 2x

San Francisco, CA $167,000.00-$185,500.00 6 days ago

San Francisco, CA $130,000.00-$145,000.00 2 weeks ago

Staff Optimization Engineer, Dynamic Pricing

San Francisco, CA $223,000.00-$248,000.00 13 hours ago

San Francisco, CA $120,000.00-$180,000.00 4 months ago

Machine Learning Engineer, Forecast Platform

San Francisco, CA $198,000.00-$220,000.00 5 days ago

Machine Learning Engineer II - Autonomous Mobility and Delivery

San Francisco, CA $167,000.00-$185,500.00 3 days ago

Oakland, CA $90,000.00-$122,000.00 12 hours ago

San Francisco, CA $120,000.00-$160,000.00 2 weeks ago

San Francisco, CA $217,400.00-$294,100.00 14 hours ago

San Francisco, CA $209,700.00-$283,800.00 14 hours ago

GenAI Staff Machine Learning Engineer, Performance Optimization

San Francisco, CA $149,998.00-$250,000.00 9 months ago

San Francisco, CA $117,000.00-$150,000.00 1 month ago

Process Engineer, application via RippleMatch

San Francisco, CA $75,000.00-$150,003.00 10 months ago

San Francisco, CA $117,000.00-$150,000.00 3 weeks ago

Software Engineer, Performance Optimization

Redwood City, CA $175,000.00-$220,000.00 1 month ago

Process Engineer, application via RippleMatch

Redwood City, CA $142,000.00-$158,000.00 3 weeks ago

Staff Deep Learning Engineer, Perception

San Francisco, CA $193,375.00-$227,500.00 5 months ago

San Mateo, CA $233,840.00-$283,780.00 3 days ago

San Francisco, CA $100,000.00-$150,000.00 2 weeks ago

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

#J-18808-Ljbffr

[job_alerts.create_a_job]

Llm Engineer • San Francisco, CA, United States

[internal_linking.similar_jobs]

ML Systems Engineer : Distributed LLM Training & Inference

Scale AI • San Francisco, CA, United States

[job_card.full_time]

A leading AI technology company in San Francisco seeks a team member to build and optimize a machine learning framework for large language models. Candidates should have system optimization experien...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Manager, REMS Data Programmer

Jazz Pharmaceuticals • Redwood City, California, USA

[job_card.full_time]

If you are a current Jazz employee please apply via the Internal Career site.Jazz Pharmaceuticals is a global biopharma company whose purpose is to innovate to transform the lives of patients and ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Travel Echo Tech - $1,670 to $1,851 per week in Menlo Park, CA

AlliedTravelNetwork • Menlo Park, CA, US

[job_card.full_time]

AlliedTravelNetwork is working with LRS Healthcare to find a qualified Echo Tech in Menlo Park, California, 94025!.Ready to start your next travel adventure? LRS Healthcare offers a full benefits p...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Multilingual Learner Interventionist

Rocketship Public Schools • Redwood City, CA, US

[job_card.full_time]

At Rocketship Public Schools, we believe in the infinite possibility of human potential.We believe that every student deserves the right to dream, to discover, and to develop their unique potential...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior MLOps Engineer

Clariti Cloud Inc. • San Francisco, CA, US

[job_card.full_time] +1

Join our mission to provide governments with exceptional experiences so they can do the same for their communities!.We empower governments to deliver exceptional citizen experiences.How will you he...[show_more]

[last_updated.last_updated_30] • [promoted]

Remote Financial Advising Expert - AI Trainer ($50-$60 / hour)

Data Annotation • Redwood City, California

[filters.remote]

[job_card.full_time] +1

We are looking for a finance professional to join our team to train AI models.You will measure the progress of these AI chatbots, evaluate their logic, and solve problems to improve the quality of ...[show_more]

[last_updated.last_updated_30] • [promoted]

EH&S Training Lead (4164U) - 82663

InsideHigherEd • Berkeley, California, United States

[job_card.full_time]

EH&S Training Lead (4164U) - 82663.At the University of California, Berkeley, we are dedicated to fostering a community where everyone feels welcome and can thrive. Our culture of openness, freedom ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

ML Systems Engineer

Genmo • San Francisco, CA, US

[job_card.full_time]

We are Genmo, a research lab dedicated to building open, state-of-the-art models for video generation towards unlocking the right brain of AGI. Join us in shaping the future of AI and pushing the bo...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff Site Reliability Engineer - Platform

Quizlet • San Francisco, CA, US

[job_card.full_time]

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, in...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Monitoring, Evaluation, and Learning (MEL) Manager

Jewish Vocational Service (JVS) • San Francisco, CA, US

[job_card.full_time]

JVS is a nonprofit working to close opportunity gaps in employment by supporting jobseekers with the skills and confidence to secure quality careers with family-sustaining wages.Grounded in core va...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

MLE, ML Platform

zaimler • San Mateo, CA, US

[job_card.full_time]

We’re creating the foundation for AI systems that don’t just generate, but retrieve, link, and reason over enterprise knowledge. In just over a year, we’ve begun partnering with Fo...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff / Principal ML Ops Engineer

PRAGMATIKE • San Francisco, CA, US

[job_card.full_time]

Cambridge, MA (Eastern Time / UTC -4).Pragmatike is hiring on behalf of a.AI startup recognized as a Top 10 GenAI company by GTM Capital. Staff / Principal ML Ops Engineer.ML infrastructure and prod...[show_more]

[last_updated.last_updated_1_day] • [promoted]

Machine Learning Engineer

Jobot • San Francisco, CA, US

[job_card.full_time]

Entry Level ML Engineer Needed for Growing AI Startup!.This Jobot Job is hosted by : Reed Kellick.Are you a fit? Easy Apply now by clicking the "Apply Now" button and sending us your resume.Salary : ...[show_more]

[last_updated.last_updated_30] • [promoted]

LLM Training Dataset and Checkpoint Optimization Engineer

Together • San Francisco, CA, United States

[job_card.full_time]

LLM Training Dataset and Checkpoint Optimization Engineer.AI infrastructure that powers the training of state-of-the-art models. We focus on creating scalable, efficient systems for handling massive...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Machine Learning Engineer

UnifyID (acquired by Prove) • Redwood City, CA, US

[job_card.full_time]

About Prove (acquired UnifyID).Prove is the modern platform for continuous identity authentication and is used by over 1,000 enterprises and 500 financial institutions, including 9 of the top 10 U....[show_more]

[last_updated.last_updated_30] • [promoted]

Training : ML Framework Engineer

OpenAI • San Francisco, CA, United States

[job_card.full_time]

Training : ML Framework Engineer.Training Runtime designs the core distributed machine-learning training runtime that powers everything from early research experiments to frontier-scale model runs.W...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Reliability Engineer

Robust.ai • San Carlos, CA, US

[job_card.full_time]

Robust AI is a fast-growing, early-stage startup founded in 2019 by an unsurpassed team of veterans in robotics, AI and business. We are a collaborative group with a wide range of backgrounds and pe...[show_more]

[last_updated.last_updated_30] • [promoted]

Peer Recovery Coach

Telecare Corporation • Redwood City, CA, United States

[job_card.full_time] +2

We have over 300 Peer roles at Telecare.We value this lived experience and this is what we are trying to grow within the organization. We have a career ladder specific to our Peer Workforce.What You...[show_more]

[last_updated.last_updated_30] • [promoted]