Talent.com
Principal DevOps Engineer - ML/AI Algorithms
Principal DevOps Engineer - ML/AI AlgorithmsF. Hoffmann-La Roche Gruppe • Pleasanton, CA, US
[error_messages.no_longer_accepting]
Principal DevOps Engineer - ML / AI Algorithms

Principal DevOps Engineer - ML / AI Algorithms

F. Hoffmann-La Roche Gruppe • Pleasanton, CA, US
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

At Roche you can show up as yourself, embraced for the unique qualities you bring. Our culture encourages personal expression, open dialogue, and genuine connections, where you are valued, accepted and respected for who you are, allowing you to thrive both personally and professionally. This is how we aim to prevent, stop and cure diseases and ensure everyone has access to healthcare today and for generations to come. Join Roche, where every voice matters.

The Position

Principal DevOps Engineer - ML / AI Algorithms

Developing software is great, but developing software with a purpose is even better! As a Principal DevOps Engineer - ML / AI Algorithms, you will work on products that help people with the most precious thing they have — their health. You will be part of the RIS Research & Development team contributing to digital health products touching Imaging, ML / AI, and computational science.

The Opportunity

As Principal DevOps Engineer, you will collaborate with important stakeholders on the development of the build, release, and deploy toolchain for DevOps, paving the way for seamless and efficient software delivery processes.

Location

This role can be based in Santa Clara (primary location) or in secondary locations (Mississauga, Canada or Basel, Switzerland).

Key Responsibilities

Lead the initiative to set up, manage, and meticulously maintain parity across development, staging, and production application environments in cutting-edge cloud infrastructure, ensuring a robust and consistent deployment pipeline.

Champion the implementation of advanced monitoring infrastructure development, empowering the team with real-time insights and ensuring the highest levels of system reliability and performance.

Provide dedicated on-call support for production operations, ensuring the uninterrupted delivery of critical services and swift resolution of any operational issues.

Interface with software developers, product managers, test engineers and administrators on projects to design and develop the build, release, and deploy toolchain for DevOps while providing on-call support.

Identify, troubleshoot and resolve issues quickly and effectively, sometimes under pressure.

Actively involved in planning, high availability engineering, performance tuning, and automation / tools development.

Manage multiple releases with focus on system reliability, scalability, and efficiency.

Implement and manage the full lifecycle of machine learning models, including versioning, deployment strategies (e.g., canary, A / B testing), monitoring for drift and performance, and decommissioning.

Bring in leadership quality to improve technology and process of devops as well as provide mentorship to other devops engineers in the team.

Who You Are

Bachelor's degree in Computer Science, Engineering, or a related field with a minimum of 8+ years of experience in a DevOps or equivalent combination of education and experience to perform at this level.

8+ years of experience with container technology, including Kubernetes, AWS EKS, Helm Charts, Splunk, and Docker, along with provisioning infrastructure through IAC using Terraform and cloud automation principles.

Proficiency in Unix / Linux administration in Shell scripting and internals with a preference for Ubuntu.

Deep working experience and extensive knowledge in building and deploying infrastructure using IaC frameworks such as terraform and AWS Cloudformation / SAM.

Experience building and automating scalable data pipelines for ingesting, transforming, distributed computing and versioning large-scale image datasets.

Familiarity with DevOps practices and proficiency in log analysis and monitoring tools are essential for effective troubleshooting and system optimization.

Proficiency in Python for automating production systems, including Git, Gitlab, Git actions, GitHub CI / CD, familiarity with common ML libraries such as TensorFlow, PyTorch, and scikit-learn to understand the engineering needs of the ML models you will be deploying.

Strong working knowledge of AWS Cloud infrastructure, including EC2, S3, API Gateway, Kubernetics, RDS, VPC peering, Route53, S3, IAM, Batch, Lambda, AWS Config and Autoscaling.

Preferred

MLOps experience with demonstrated experience supporting machine learning or computer vision teams.

Deep experience with container orchestration for ML workloads using Kubernetes, including frameworks like Kubeflow or KubeRay to manage distributed training jobs.

Familiarity with data versioning tools like DVC.

Familiarity with common ML libraries such as TensorFlow, PyTorch, and scikit-learn to understand the engineering needs of the ML models.

Familiarity with other languages such as Java, R, and C / C++.

Experience with AWS services for machine learning, such as Amazon SageMaker, and experience managing GPU-accelerated compute instances (e.g., EC2 P and G series) for model training and inference.

The expected salary range for this position based on the primary location of Santa Clara, CA is between $162,600 and $302,000. Actual pay will be determined based on experience, qualifications, geographic location, and other job-related factors permitted by law. A discretionary annual bonus may be available based on individual and Company performance. This position also qualifies for the benefits detailed at the link provided below.

Benefits

Relocation benefits are not available for this position.

Who we are

A healthier future drives us to innovate. Together, more than 100'000 employees across the globe are dedicated to advance science, ensuring everyone has access to healthcare today and for generations to come. Our efforts result in more than 26 million people treated with our medicines and over 30 billion tests conducted using our Diagnostics products. We empower each other to explore new possibilities, foster creativity, and keep our ambitions high, so we can deliver life-changing healthcare solutions that make a global impact.

Let's build a healthier future, together.

Roche is an equal opportunity employer. It is our policy and practice to employ, promote, and otherwise treat any and all employees and applicants on the basis of merit, qualifications, and competence. The company's policy prohibits unlawful discrimination, including but not limited to, discrimination on the basis of Protected Veteran status, individuals with disabilities status, and consistent with all federal, state, or local laws.

If you have a disability and need an accommodation in relation to the online application process, please contact us by completing this form Accommodations for Applicants.

J-18808-Ljbffr

[job_alerts.create_a_job]

Principal Engineer • Pleasanton, CA, US

[internal_linking.similar_jobs]
Senior Staff Machine Learning Engineer - DevOps / Site Reliability Engineer

Senior Staff Machine Learning Engineer - DevOps / Site Reliability Engineer

Servicenow • Santa Clara, California, United States
[job_card.full_time] +1
It all started in sunny San Diego, California in 2004 when a visionary engineer, Fred Luddy, saw the potential to transform how we work. Fast forward to today — ServiceNow stands as a global market ...[show_more]
[last_updated.last_updated_30] • [promoted]
AI / ML Computing Cluster Engineer

AI / ML Computing Cluster Engineer

Sk Hynix America • San Jose, California, United States
[job_card.full_time]
Job Title : AI / ML Computing Cluster Engineer.At SK hynix America, we're at the forefront of semiconductor innovation, developing advanced memory solutions that power everything from smartphones to d...[show_more]
[last_updated.last_updated_30] • [promoted]
Sr. ML Engineer, AI Cloud

Sr. ML Engineer, AI Cloud

Tenstorrent • Santa Clara, California, United States
[job_card.full_time] +1
Tenstorrent is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency. With AI redefining the computing paradigm, solutions mu...[show_more]
[last_updated.last_updated_30] • [promoted]
Principal MTS, Machine Learning Engineer

Principal MTS, Machine Learning Engineer

Paypal • San Jose, California, United States
[job_card.full_time]
PayPal has been revolutionizing commerce globally for more than 25 years.Creating innovative experiences that make moving money, selling, and shopping simple, personalized, and secure, PayPal empow...[show_more]
[last_updated.last_updated_30] • [promoted]
Principal Machine Learning Engineer - Autonomy

Principal Machine Learning Engineer - Autonomy

Wayve • Sunnyvale, California, United States
[job_card.full_time]
At Wayve we're committed to creating a diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives, and regardless of sex, race, religion or bel...[show_more]
[last_updated.last_updated_30] • [promoted]
Generative AI - ML System Engineering

Generative AI - ML System Engineering

Meshy • Sunnyvale, CA, US
[job_card.full_time]
We are looking for Machine Learning Systems Engineers who can help us build the world's largest end-to-end 3D native machine learning systems. You will help us build our end to end ML framework ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Platform Machine Learning Engineer

Senior Platform Machine Learning Engineer

Earnin • Mountain View, California, United States
[job_card.full_time]
As one of the first pioneers of earned wage access, our passion at EarnIn is building products that deliver real-time financial flexibility for those with the unique needs of living paycheck to pay...[show_more]
[last_updated.last_updated_30] • [promoted]
Sr. DevOps Engineer

Sr. DevOps Engineer

Supermicro • San Jose, CA, United States
[job_card.full_time]
Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
[last_updated.last_updated_30] • [promoted]
Staff / Principal Machine Learning Engineer

Staff / Principal Machine Learning Engineer

Inworld Ai • Mountain View, California, United States
[job_card.full_time]
At Inworld, we believe the processes of building, scaling, and evolving applications are monsters that consume value before it can reach users. Our mission is to solve evolution and transform static...[show_more]
[last_updated.last_updated_30] • [promoted]
Staff / Principal DevOps Engineer (FortiAppSec)

Staff / Principal DevOps Engineer (FortiAppSec)

Fortinet • Sunnyvale, CA, United States
[job_card.full_time]
We are seeking a highly skilled DevOps Engineer to join our team.In this role, you will design, implement, and maintain scalable, resilient, and secure infrastructure. You will work closely with Dev...[show_more]
[last_updated.last_updated_30] • [promoted]
Principal AI Engineer

Principal AI Engineer

TENEX.AI • San Jose, California, United States
[job_card.full_time]
TENEX is an AI-native, automation-first, built-for-scale Managed Detection and Response (MDR) provider.We are a force multiplier for defenders, helping organizations enhance their cybersecurity pos...[show_more]
[last_updated.last_updated_30] • [promoted]
Machine Learning Engineer, NLP and multimodal

Machine Learning Engineer, NLP and multimodal

Newsbreak • Mountain View, California, United States
[job_card.full_time]
NewsBreak is redefining the way users interact with local news and their communities.By bridging local users, local content creators, and local businesses, our mission is to foster safer, more vibr...[show_more]
[last_updated.last_updated_30] • [promoted]
Machine Learning Engineer - Intelligent Agents & Systems

Machine Learning Engineer - Intelligent Agents & Systems

Zyphra • Palo Alto, California, United States
[job_card.full_time]
Agentic Systems and Interaction projects.You will be at the forefront of building a next-generation desktop and browser-based agent that can autonomously navigate the web, interact with filesystems...[show_more]
[last_updated.last_updated_30] • [promoted]
AI Engineer

AI Engineer

Zone It Solutions • San Jose, California, United States
[job_card.full_time]
We are on the lookout for an innovative and driven.In this role, you will be responsible for designing, developing, and deploying AI models that will enhance our products and improve our services.B...[show_more]
[last_updated.last_updated_30] • [promoted]
Machine Learning Infrastructure Engineer

Machine Learning Infrastructure Engineer

Institute of Foundation Models • Sunnyvale, CA, US
[job_card.full_time]
About the Institute of Foundation Models.We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next...[show_more]
[last_updated.last_updated_30] • [promoted]
Software Engineer L4, Machine Learning Platform (Metaflow)

Software Engineer L4, Machine Learning Platform (Metaflow)

Netflix • Los Gatos, California, United States
[job_card.full_time]
Netflix is one of the world's leading entertainment services, with 283 million paid memberships in over 190 countries enjoying TV series, films and games across a wide variety of genres and lan...[show_more]
[last_updated.last_updated_30] • [promoted]
Elasticsearch - Principal Software Engineer II - Vector Search

Elasticsearch - Principal Software Engineer II - Vector Search

Elastic • Mountain View, CA, United States
[job_card.full_time]
Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
[last_updated.last_updated_30] • [promoted]
Senior Machine Learning Operations (MLOps) Engineer

Senior Machine Learning Operations (MLOps) Engineer

Bonfy-ai • Mountain View, California, United States
[job_card.full_time]
AI is building the trust layer for generative AI.Our Adaptive Content Security platform detects and mitigates subtle risks embedded in large language model (LLM) outputs before they reach users.Fro...[show_more]
[last_updated.last_updated_30] • [promoted]