Site Reliability EngineerSpeak • San Francisco, CA, United States

Site Reliability Engineer

Speak • San Francisco, CA, United States

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

About us

Our mission is to reinvent the way people learn, starting with language.

Learning a language can change a life by opening doors to new cultures, careers, and communities. Two billion people around the world are actively trying to learn a language, but the best way to learn (one‑on‑one tutoring) is hard to access at scale and hasn’t been meaningfully improved in decades. Speak is building a human‑level, AI‑powered tutor in your pocket : a conversation‑first experience that lets learners actually speak, get instant feedback, and progress through carefully designed lessons. The result is a complete path from beginner to confident speaker across multiple languages.

Speak first launched in South Korea in 2019, where Speak has now become the number one language learning app, and we now serve learners across many markets and 15+ languages. Speak is one of the world’s leading AI companies, with over $150m raised in venture investment from OpenAI, Accel, Founders Fund, Khosla Ventures, and more, with a distributed team across San Francisco, Seoul, Tokyo, Taipei, and Ljubljana.

About this role

As an SRE Engineer at Speak , you’ll be the driving force behind the reliability and resilience of the systems that power our global language learning experience. You’ll lead efforts to scale our infrastructure, harden our platform, and ensure that our services are fast, available, and reliable for millions of users around the world.

You’ll work across our stack—from Kubernetes on GCP to our Node.js APIs, Postgres, and Redis —building robust infrastructure and operational tooling. You’ll own incident response, observability, and SLOs while embedding a culture of reliability throughout the engineering org.

Speak is growing rapidly, and we’re pushing our systems harder every day. This is a unique opportunity to shape the future of our platform as we scale to the next 10x of users.

What you’ll be doing

Own the reliability of Speak’s infrastructure across GCP, Kubernetes, and our Node.js / Postgres stack

Lead response for P0 / P1 incidents, drive postmortems, and ensure we’re learning from every outage

Improve observability, alerting, and on‑call processes so we catch issues before users do

Define and drive adoption of SLOs / SLAs for core systems and services

Build tools and frameworks to make reliability easier for product engineers—think safer deploys and infrastructure automation

Collaborate cross‑functionally with Product, Engineering, and ML teams to ensure reliability is baked into everything we build

Set short term and long term roadmaps to ensure stability for our growing userbase.

Be a thought leader and coach around SRE principles—blameless culture, operational maturity, and continuous improvement

What we’re looking for

7+ years of experience in SRE, DevOps, or infrastructure‑focused engineering roles, ideally with experience leading or mentoring others

Strong experience with GCP , Kubernetes , Terraform , Node.js , Python , PostgreSQL , Redis , and observability tooling like Prometheus and Sentry

Proven track record of improving reliability, scaling systems, and reducing incident frequency and severity with high traffic systems

Strong incident management and root cause analysis skills—you know how to lead under pressure

Experience building and maintaining CI / CD pipelines and deployment safety tooling

Strong systems thinking, with the ability to identify failure points and proactively harden services

Deep sense of ownership and a desire to make infrastructure a force multiplier for the rest of the org

Bonus

Familiarity with cost optimization strategies in cloud‑native environments

Background in security, chaos engineering, or disaster recovery planning

Contributions to internal tooling, automation, or developer productivity initiatives

Why work at Speak

Join a fantastic, tight‑knit team at the right time : we're growing very quickly, we've most recently raised our Series C from some of the top investors in the valley, and we've achieved product‑market fit in our initial markets. You'd join at a magical time when a single person could significantly change the course of the company.

Do your life's work with people you’ll love working with : we care strongly about our craft and want every person at Speak to feel like they're growing every day. We believe in the idea that working with people you both enjoy and have respect for makes everything better. We hire thoughtfully and only work with people we admire deeply.

Global in nature : We're live in over 40 countries and launching in a number of new markets soon. We have dedicated offices in San Francisco, Ljubljana, Seoul, and Tokyo, and you’ll have the opportunity to talk to users in each of these regions on a regular basis as well as travel.

Impact people's lives in a major way : Learning a language is one of the single most life‑changing skills one can learn, and right now 99% of people never achieve their goal because the process is broken. We’re helping millions of people achieve their goals and improve their lives.

Speak does not discriminate based upon race, religion, color, national origin, gender (including pregnancy, childbirth, or related medical conditions), sexual orientation, gender identity, gender expression, age, status as a protected veteran, status as an individual with a disability, or other applicable legally protected characteristics.

#J-18808-Ljbffr

[job_alerts.create_a_job]

Site Reliability Engineer • San Francisco, CA, United States

[internal_linking.similar_jobs]

Site Reliability Engineer

Mercor, Inc. • San Francisco, California, United States

[job_card.full_time]

About Mercor Mercor is at the intersection of labor markets and AI research.We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development.Our vast ta...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Senior Technology Site Reliability Engineer

Cooley LLP • San Francisco, CA, United States

[job_card.full_time]

Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer I

prosper.com • San Francisco, CA, United States

[job_card.full_time]

As a Site Reliability Engineer I at Prosper, you will play a crucial role in enhancing the reliability, scalability, and maintainability of our technology platform. This entry‑level position is desi...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer – Platform

Icon Ventures • San Francisco, CA, United States

[job_card.full_time]

At Quizlet, our mission is to help every learner achieve their outcomes in the most effective and delightful way.We blend cognitive science with machine learning to personalize and enhance the lear...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer (SRE)

SS&C Technologies • San Francisco, CA, United States

[job_card.full_time]

SS&C Technologies is a global investment and financial services software provider, headquartered in Windsor, Connecticut, and supporting more than 28,000 employees across 35 countries.It specialize...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Rethink recruit • San Francisco, CA, United States

[job_card.full_time]

Runloop is building the foundational infrastructure for the next generation of AI development.We provide AI engineers and data scientists with lightning-fast, secure, and reproducible code sandboxe...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Mercor • San Francisco, CA, United States

[job_card.full_time]

Mercor is at the intersection of labor markets and AI research.We partner with leading AI labs and enterprises to provide the human intelligence essential to AI development.Our vast talent network ...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Site Reliability Engineer

WorkOS • San Francisco, CA, United States

[job_card.full_time]

WorkOS builds tools and services for developers to help them implement authentication, identity, authorization, and overall enterprise readiness. We’re a fully distributed team with employees across...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff Site Reliability Engineer

Redwood Materials, Inc. • San Francisco, CA, United States

[job_card.full_time]

Redwood is localizing a global battery supply chain that seamlessly integrates recovery, reuse, and recycling—keeping critical minerals in circulation and driving the energy transition.Founded in 2...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer

Early Warning Services LLC • San Francisco, CA, United States

[job_card.full_time]

Positions located in Scottsdale, San Francisco, Chicago, or New York follow a hybrid work model to allow for a more collaborative working environment. Candidates responding to this posting must inde...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer

Alembic Technologies • San Francisco, CA, United States

[job_card.full_time]

Senior Site Reliability Engineer.This range is provided by Alembic Technologies.Your actual pay will be based on your skills and experience — talk with your recruiter to learn more.We’re looking fo...[show_more]

[last_updated.last_updated_30] • [promoted]

Staff / Principal Site Reliability Engineer

The Resume Database • Redwood City, CA, United States

[job_card.full_time]

Staff / Principal Site Reliability Engineer.Staff / Principal Site Reliability Engineer.You’ll architect scalable solutions, navigate complex technical challenges independently, and deliver results und...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

gamma.app • San Francisco, CA, United States

[job_card.full_time]

We're building the creative layer for modern communication.Every month, over a billion people make presentations — but the tools they use to make them haven't evolved in decades.We're changing that...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Primer • San Francisco, CA, United States

[job_card.full_time]

Primer helps B2B products break out of the B2C-centric marketing box.Our platform turns consumer ad channels, data streams, and emerging AI workflows into measurable growth engines for go-to-market...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer

Hive • San Francisco, CA, United States

[job_card.full_time]

Hive is the leading provider of cloud-based AI solutions to understand, search, and generate content, and is trusted by hundreds of the world's largest and most innovative organizations.The company...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer

Early Warning® • San Francisco, CA, United States

[job_card.full_time]

At Early Warning, we’ve powered and protected the U.Zelle®, Paze℠, and so much more.As a trusted name in payments, we partner with thousands of institutions to increase access to financial services...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Site Reliability Engineer

AppOmni • San Francisco, CA, United States

[job_card.full_time]

AppOmni, a leader in SaaS Security, helps customers achieve secure productivity with their applications.Security teams and owners can quickly detect and mitigate threats using unmatched depth of pr...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

ConductorOne Inc. • San Francisco, CA, United States

[job_card.full_time]

ConductorOne is the first AI-native identity security platform that protects every identity : human, non-human, and AI.With powerful automation, platform-level AI, and out-of-the-box connectors, it ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]