Site Reliability EngineerRethink recruit • San Francisco, CA, United States

Site Reliability Engineer

Rethink recruit • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

About Runloop

Runloop is building the foundational infrastructure for the next generation of AI development. We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxes. Our platform eliminates friction in environment setup and dependencies, enabling teams to experiment, iterate, and deploy seamlessly. We’re a small but dedicated team working to deliver a rock-solid platform that empowers innovation.

The Role

We’re looking for a skilled Site Reliability Engineer (SRE) to ensure the reliability, observability, performance, and security of our core platform—the foundation upon which our users build. You’ll work closely with engineering to maintain resilient systems that power our code sandboxes, while mentoring peers on reliability practices. This role blends deep operational expertise with a software engineering mindset.

What You’ll Do

Design, operate, and improve production infrastructure on AWS, GCP, or Azure.
Define and monitor SLIs / SLOs, manage error budgets, and maintain observability with Prometheus, Grafana, and logging / tracing frameworks.
Build automation for deployments, scaling, and recovery—reducing toil and creating self-healing systems.
Lead incident response, root‑cause analysis, and blameless post‑mortems.
Collaborate with developers to design scalable, reliable services.
Optimize distributed systems, networking, and sandbox performance.
Plan for capacity growth and support safe release / change management.
Mentor engineers on reliability and front‑end distributed systems (CDNs, caching, client observability).

Qualifications

Proven experience as an SRE, DevOps Engineer, or similar role.

Strong programming skills (Python or Go preferred).

Deep knowledge of containerization (Docker, Kubernetes).

Expertise in infrastructure-as-code (Terraform or Pulumi).

Strong understanding of networking, Linux, and system security.

Hands‑on experience with distributed systems and observability (metrics, logs, tracing).

Skilled in incident management, on‑call rotations, and post‑mortem processes.

Ability to mentor and influence best practices across teams.

Bonus Points

Experience with chaos engineering, CI / CD for front‑end delivery, or observability tools like Sentry, RUM, or synthetic monitoring.

Benefits

Competitive salary and equity.

Comprehensive health, dental, and vision insurance for you and your dependents.

Free lunch and snacks.

Opportunity to shape the future of AI‑driven software engineering in a high‑impact role.

Location

On‑site in San Francisco, CA (in office 4 days / week, optional 1 day WFH).

Join Us

If you’re passionate about building resilient systems that empower developers and want to shape the future of AI‑driven software engineering, we’d love to hear from you. Join Runloop and help build the infrastructure that powers tomorrow’s AI.

Runloop is an Equal Opportunity Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, disability status, protected veteran status, sexual orientation, gender identity, or any other characteristic protected by law.

#J-18808-Ljbffr

[job_alerts.create_a_job]

Site Reliability Engineer • San Francisco, CA, United States

[internal_linking.similar_jobs]

Site Reliability Engineer US - San Francisco

Near Inc. • San Francisco, CA, United States

[job_card.full_time]

The NEAR AI engineering team is developing decentralized and confidential machine learning infrastructure to power user owned AI. We currently focus on building infrastructure to enable private and ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Site Reliability Engineer - Networking

Lambda Inc. • San Francisco, CA, United States

[job_card.full_time]

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers.Our customers range from AI researchers to enterprises and hyperscalers.Lambda's m...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer – Platform

Icon Ventures • San Francisco, CA, United States

[job_card.full_time]

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer Hybrid - San Francisco

Grammarly, Inc. • San Francisco, CA, United States

[job_card.full_time]

Superhuman offers a dynamic hybrid working model for this role.This flexible approach gives team members the best of both worlds : plenty of focus time along with in-person collaboration that helps ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer (SRE)

Air Apps • San Francisco, California, US

[job_card.full_time]

At Air Apps, we believe in thinking bigger—and moving faster.We’re a family-founded company on a mission to create the world’s first AI-powered Personal & Entrepreneurial Resource Planner (PRP)...[show_more]

[last_updated.last_updated_less] • [promoted] • [new]

Senior Site Reliability Engineer : Scale & Reliability

Google Inc. • San Francisco, CA, United States

[job_card.full_time]

A leading technology firm in San Francisco is seeking a Software Engineer III for site reliability engineering.This full-time role requires a Bachelor's degree in Computer Science and at least two ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Software Engineer, Site Reliability

Roblox • San Mateo, California, United States

[job_card.full_time]

You’ll collaborate with cross-functional teams to build robust infrastructure that supports our growth.If you have a track record of solving complex technical challenges, we want to hear from you.J...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Software Engineer, Site Reliability Engineer (SRE)

Harvey • San Francisco, California, United States

[job_card.full_time]

Harvey is a secure AI platform for legal and professional services that augments productivity and automates complex workflows. Harvey uses algorithms with reasoning-adept LLMs that have been customi...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

WorkOS • San Francisco, CA, United States

[job_card.full_time]

WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer (SRE)

Air Apps, Inc. • San Francisco, CA, United States

[job_card.full_time]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

gamma.app • San Francisco, CA, United States

[job_card.full_time]

We're building the creative layer for modern communication.Every month, over a billion people make presentations — but the tools they use to make them haven't evolved in decades.We're changing that...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Software Engineer, Site Reliability Engineer (SRE)

harvey.ai • San Francisco, CA, United States

[job_card.full_time]

At Harvey, we’re transforming how legal and professional services operate — not incrementally, but end-to-end.By combining frontier agentic AI, an enterprise-grade platform, and deep domain experti...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Happyrobot Inc. • San Francisco, CA, United States

[job_card.full_time]

HappyRobot is the AI-native operating system for the real economy—a system that closes the circuit between intelligence and action. By combining real-time truth, specialized AI workers, and an orche...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Site Reliability Engineer

Mvp VC • San Francisco, CA, United States

[job_card.full_time]

Loft Orbital is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit.We operate satellit...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer III

Veeam • San Francisco, CA, United States

[job_card.full_time]

Veeam, the #1 global market leader in data resilience, believes businesses should control all their data whenever and wherever they need it. Veeam provides data resilience through data backup, data ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Founding Site Reliability Engineer

Assort Health Inc. • San Francisco, CA, United States

[job_card.full_time]

Our mission is to make exceptional healthcare accessible anytime, anywhere, for everyone.At Assort Health, we believe healthcare should feel effortless and connected — quick answers, clear communic...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

The Voleon Group • Berkeley, CA, United States

[job_card.full_time]

Voleon is a technology company that applies state‑of‑the‑art AI and machine learning techniques to real‑world problems in finance. For nearly two decades, we have led our industry and worked at the ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Site Reliability Engineer, Healthcare Cloud Infrastructure and Networking

Collective Health • San Francisco, CA, United States

[job_card.full_time]

Senior Site Reliability Engineer, Healthcare Cloud Infrastructure and Networking.At Collective Health, we’re transforming how employers and their people engage with their health benefits by seamles...[show_more]

[last_updated.last_updated_30] • [promoted]