Talent.com

Reliability engineer Jobs in Oakland, CA

Create a job alert for this search

Reliability engineer • oakland ca

Last updated: 22 hours ago

Senior Site Reliability Engineer

TechChain TalentSan Francisco, California, United States
Full-time
Quick Apply

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security.You'll define and maintain SLO...Show more

Reliability Manager

Georgia-PacificSan Leandro, US
Full-time

Georgia-Pacific, LLC is now hiring a Reliability Manager for our Corrugated facility located in San Leandro, CA.The ideal candidate will be a self-driven individual with a passion for excellence in...Show more

Software Engineer

Rainier RecruitingSan Francisco, CA, US
Full-time
Quick Apply

Location: 100% remote, anywhere in the US, but observing 10am-3pm PST working hours.Compensation: $140,000 plus bonus and equity.We’re looking for a mid-level Software Engineer to take ownership of...Show more

Lead Reliability Engineer

SafewaySan Leandro, CA, United States
Full-time

Are you ready to take the next step in your career? Join us for an exciting opportunity at Albertsons Companies, where innovation and customer service go hand-in-hand!.At Albertsons Companies, we a...Show more

Staff Mechanical Engineer, Spacecraft Engineer

Motive CompaniesAlameda, CA, US
Permanent

We are seeking a Staff Mechanical Engineer to lead the mechanical design and manufacturability of next-generation Hall-effect thrusters for spacecraft propulsion.This position is critical in ensuri...Show more

Chief Engineer

Midas HospitalityOakland, CA, USA
Full-time
Quick Apply

Chief Engineer - Courtyard by Marriott Oakland Airport.Midas Hospitality is recognized as one of the Top 100 U.Employers in 2021 (by MogulRecruiter).Ranking #30 for talent, #13 for diversity, #33 f...Show more

Sr Principal Site Reliability Engineer

Disney Entertainment and ESPN Product & TechnologySan Francisco, Californie, États-Unis
Full-time

P5/P6: SRE Lead, Content Distribution Engineering.On any given day at Disney Entertainment & ESPN Technology, we’re reimagining ways to create magical viewing experiences for the world’s most belov...Show more

Reliability/Test Technician

SoloPoint Solutions, Inc.San Francisco, CA, US
Full-time
Quick Apply

Associate's degree in Electrical Engineering Technology, Electronics Technology, or a related technical field; or equivalent practical experience.Hands-on experience with environmental testing (tem...Show more

Senior Director - Reliability Operations

816 GPS Services, Inc.Folsom,SF
Full-time

The Senior Director - Reliability Operations, is a strategic leader accountable for ensuring the reliability, availability, and performance of the enterprise technology ecosystem.This role oversees...Show more

Senior Manager, DevOps & SRE – Platform Reliability & Global Operations

QcellsSan Francisco, CA, US
Full-time

The Senior DevOps & SRE Manager – Platform Reliability & Global Operations is a senior technical leader responsible for the reliability, scalability, security, and operational excellence of a compl...Show more

System Engineer

OusterSan Francisco, CA, US
Full-time
Quick Apply

At Ouster, we design and manufacture LIDAR sensors for precision mapping, robotics, automotive, security systems, smart cities and various industrial solutions.We've transformed LIDAR from an analo...Show more

Mechanical Engineer – Onshore Reliability

Hudson ManpowerSan Francisco, CA, US
Full-time

Mechanical Engineer – Onshore Reliability.Bachelor’s Degree in Mechanical Engineering.Oil & Gas / Refinery (Onshore).The Mechanical Engineer – Onshore Reliability will be responsible for improving ...Show more

AI FullStack Engineer / Founding Engineer

Career Mentors, LLCSan Francisco, CA, US
Full-time

AI Full-Stack Engineer / Founding Engineer.Base Salary: $100,000 – $140,000.Hangzhou / Silicon Valley / Remote (Flexible).We’re not hiring someone to just execute tasks.Founding-level AI Full-Stack...Show more

Analytics Engineer

StateHouse Holdings Inc.San Francisco, CA, US
Full-time

The Analytics Engineer is responsible for designing, building, and maintaining scalable analytics data models and business intelligence solutions that enable data-driven decision-making across the ...Show more

Software Engineer

JT4San Francisco, CA
Full-time

Benefield Anechoic Facility (BAF) at.The primary role involves supporting the RF Software Systems Element.The candidate's main responsibilities include providing engineering expertise and support t...Show more

Associate Engineer - Software Engineer

MaximusSan Francisco, US
Full-time

Essential Duties and Responsibilities: - Design systems and programs to meet complex business needs.Code, test, debug, implement, and document moderately complex software programs.Prepare detailed ...Show more

 • New!

Project Engineer

Sherwood Design EngineersSan Francisco, CA, US
Full-time
Quick Apply

Project Engineer CA About Sherwood Sherwood is a civil and environmental engineering firm that is committed to investing in and embracing people, communities and the environment.Our team has delive...Show more

Founding Engineer

hireVouchSan Francisco, California, USA
Full-time

We’re hiring a Founding Full‑Stack Engineer to build V1 products end to end.You’ll work in a fast, hands‑on environment, iterating quickly with customers to solve real clinical workflow problems.Ow...Show more

Site Reliability engineer

IKR EnterprisesSan Francisco, CA, United States
Full-time

San Francisco, CA or New York, NY | Full-time | In-Office.This clinical AI company works with dozens of the nation's leading health systems and helps millions of patients annually get faster access...Show more

People also ask
Senior Site Reliability Engineer

Senior Site Reliability Engineer

TechChain TalentSan Francisco, California, United States
30+ days ago
Job type
  • Full-time
  • Quick Apply
Job description

About the Role

We're seeking a Site Reliability Engineer to ensure Hyperbolic's GPU marketplace and AI infrastructure operate with exceptional reliability, performance, and security. You'll define and maintain SLOs, build incident response systems, manage capacity across our distributed GPU network, and implement secure rollout/rollback mechanisms.

Requirements

- Experience in site reliability engineering, including working with SLOs and SLAs for production systems

- Experience with capacity planning and resource management for distributed systems

- Experience with incident response, on-call rotations, and post-mortem processes

- Experience with deployment systems (e.g., canary deployments, feature flags, automated rollbacks)

- Experience with observability tools (e.g., Prometheus, Grafana, ELK stack, logging, tracing, alerting)

- Experience with infrastructure security (e.g., network segmentation, workload isolation, security hardening)

- Experience with secrets management and key management systems (KMS)

- Experience with compliance frameworks (e.g., SOC 2, ISO 27001)

- Experience debugging distributed systems

- Experience with infrastructure-as-code, configuration management, and CI/CD pipelines

Bonus Skills

- Experience operating GPU infrastructure, AI/ML platforms, or compute marketplaces at scale

- Knowledge of multi-tenancy security patterns, container security, and runtime security tools

- Experience with chaos engineering, fault injection, and resilience testing

- Experience building and operating systems with 99.9%+ SLA uptime requirements