Principal Site Reliability EngineerCrusoe Energy Systems LLC • San Francisco, California, United States

Principal Site Reliability Engineer

Crusoe Energy Systems LLC • San Francisco, California, United States

[job_card.variable_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, speed, or sustainability.

Be a part of the AI revolution with sustainable technology at Crusoe. Here, you'll drive meaningful innovation, make a tangible impact, and join a team that’s setting the pace for responsible, transformative cloud infrastructure.

About This Role

As a Principal Site Reliability Engineer, you will play a critical role in designing and operating a next-generation NeoCloud built for AI, GPU, and high-performance workloads. This role sits at the intersection of infrastructure architecture, reliability engineering, and technical leadership. You’ll set reliability strategy, influence platform design, and ensure the cloud scales safely, efficiently, and predictably as customer demand accelerates.

You are a hands‑on technical leader who thrives in complex distributed systems, drives clarity in ambiguous environments, and raises the bar for operational excellence across the organization.

What You’ll Be Working On

Define and own the reliability architecture for a NeoCloud platform supporting GPU‑dense, latency‑sensitive, and large‑scale distributed workloads

Design and evolve SLOs, SLIs, and error budgets that meaningfully balance reliability, velocity, and customer experience

Lead incident response strategy for high‑severity events, including root cause analysis and long‑term remediation

Architect and improve observability systems (metrics, logs, tracing) to support rapid detection and diagnosis at scale

Partner with Infrastructure, Networking, Hardware, and Platform teams to influence system design before production issues occur

Drive automation across provisioning, deployment, capacity management, and failure recovery

Establish best practices for on‑call health, operational readiness, and production change management

Serve as a technical authority and mentor for senior and staff‑level engineers across the SRE and infrastructure org

What You’ll Bring to the Team

10+ years of experience operating and scaling large‑scale distributed systems in production environments

Deep expertise in SRE principles : reliability modeling, incident management, toil reduction, and systems thinking

Strong background in cloud or infrastructure platforms (public cloud, private cloud, or NeoCloud environments)

Hands‑on experience with Kubernetes and containerized workloads at scale

Proficiency in one or more programming languages (Go, Python, Rust, or similar) with production‑grade code ownership

Strong understanding of Linux systems, networking fundamentals, and performance bottlenecks

Proven ability to lead through influence — setting direction across teams without direct authority

Exceptional communication skills, especially during high‑stakes incidents and cross‑functional decision‑making

Bonus Points

Experience supporting GPU‑based, AI / ML, or HPC workloads

Familiarity with bare‑metal provisioning, hardware lifecycle management, or data center operations

Experience building or scaling a NeoCloud or cloud‑adjacent platform from early growth to maturity

Background in capacity planning for GPU, storage, or high‑throughput networking environments

Passion for sustainable infrastructure or next‑generation cloud architectures

Benefits :

Industry competitive pay

Restricted Stock Units in a fast growing, well‑funded technology company

Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents

Employer contributions to HSA accounts

Paid Parental Leave

Paid life insurance, short‑term and long‑term disability

Teladoc

401(k) with a 100% match up to 4% of salary

Generous paid time off and holiday schedule

Cell phone reimbursement

Tuition reimbursement

Subscription to the Calm app

MetLife Legal

Company paid commuter benefit; $300 per month

Compensation :

Compensation will be paid in the range of $261,000 - $326,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant’s education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

Crusoe is an Equal Opportunity Employer. Employment decisions are made without regard to race, color, religion, disability, genetic information, pregnancy, citizenship, marital status, sex / gender, sexual preference / orientation, gender identity, age, veteran status, national origin, or any other status protected by law or regulation.

#J-18808-Ljbffr

[job_alerts.create_a_job]

Site Reliability Engineer • San Francisco, California, United States

[internal_linking.similar_jobs]

Principal Site Reliability Engineer

VirtualVocations • Oakland, California, United States

[job_card.full_time]

A company is looking for a Principal Site Reliability Engineer to ensure the reliability and scalability of its financial technology platform. Key Responsibilities Define and drive the SRE strateg...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer

Early Warning Services LLC • San Francisco, CA, United States

[job_card.full_time]

Positions located in Scottsdale, San Francisco, Chicago, or New York follow a hybrid work model to allow for a more collaborative working environment. Candidates responding to this posting must inde...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineering

Forhyre • San Francisco, CA, US

[job_card.full_time]

Forhyre is looking for engineers who can bring unique perspectives and innovative ideas to all areas of development and are interested in continuing to improve our platform through the ever-changin...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Rethink recruit • San Francisco, CA, United States

[job_card.full_time]

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Senior Site Reliability Engineer - Platform

Quizlet • San Francisco, CA, US

[job_card.full_time]

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.Our $1B+ learning platform serves tens of millions of students every month, in...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer

Zipline • South San Francisco, CA, US

[job_card.full_time]

Do you want to change the world? Zipline is on a mission to transform the way goods move.Our aim is to solve the world's most urgent and complex access challenges by building, manufacturing and...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer I

prosper.com • San Francisco, CA, United States

[job_card.full_time]

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry‑level position is desi...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States

[job_card.full_time]

Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer

Early Warning® • San Francisco, CA, United States

[job_card.full_time]

At Early Warning, we’ve powered and protected the U.Zelle®, Paze℠, and so much more.As a trusted name in payments, we partner with thousands of institutions to increase access to financial services...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

gamma.app • San Francisco, CA, United States

[job_card.full_time]

We're building the creative layer for modern communication.Every month, over a billion people make presentations — but the tools they use to make them haven't evolved in decades.We're changing that...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Hamilton Barnes • San Francisco Bay Area, United States

[job_card.full_time]

Senior Platform Engineer / Site Reliability Engineer – AI Infrastructure.Join a stealth-mode startup building out their AI and cloud platform, powered by thousands of H100s, H200s, and B200s, ready t...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Principal Site Reliability Operations Engineer

Roblox • San Mateo, California, USA

[job_card.full_time]

As a Senior Site Reliability Operations Engineer on the Reliability Team you will manage production incidents and improve Robloxs incident processes while reporting to the Senior Operations Manager...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Fractal • San Francisco, CA, United States

[job_card.full_time]

This range is provided by Fractal.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more. Fractal Analytics is a strategic AI partner to Fortune 500 com...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Primer • San Francisco, CA, United States

[job_card.full_time]

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior+ Site Reliability Engineer

Crusoe • San Francisco, CA, US

[job_card.full_time]

Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrif...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Speak • San Francisco, CA, United States

[job_card.full_time]

Our mission is to reinvent the way people learn, starting with language.Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around th...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer

Crusoe Energy Systems LLC • San Francisco, CA, United States

[job_card.full_time]

Crusoe's mission is to accelerate the abundance of energy and intelligence.We’re crafting the engine that powers a world where people can create ambitiously with AI — without sacrificing scale, spe...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer I

Prosper • San Francisco, CA, US

[job_card.full_time]

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry-level position is desi...[show_more]

[last_updated.last_updated_30] • [promoted]