Talent.com
Distributed Systems Engineer
Distributed Systems Engineerkrea.ai • San Francisco, California, United States
Distributed Systems Engineer

Distributed Systems Engineer

krea.ai • San Francisco, California, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

About Krea

At Krea, we are building next‑generation AI creative tools. We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it. We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We’re building better, smarter, and more controllable tools to harness this medium.

This job

Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real‑time user experiences, and large‑scale model deployments. As a Distributed Systems Engineer, you will design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving. You will own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations. You will collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment. You will improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Responsibilities

Design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving.

Own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations.

Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.

Improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Example Projects

Own and manage a large‑scale Kubernetes cluster designed to run extensive ML training and inference workloads.

Architect fault‑tolerant systems ensuring uninterrupted model training and real‑time inference despite individual node failures.

Develop and implement optimized load‑balancing strategies to efficiently distribute workloads across zones.

Create comprehensive monitoring, alerting systems, and operational playbooks for high‑availability clusters.

Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability.

Setting up IP‑based rate‑limiting to prevent GPU abuse.

Strong Candidates May Have Experience With

Kubernetes at scale (thousands of nodes)

Cloud infrastructure management (AWS / GCP / Azure)

High‑performance and fault‑tolerant networking

Low‑level Linux interfaces and administration

Debugging complex distributed systems in production

Python, Golang, Ruby, Rust, and similar systems languages

Bonus : Infrastructure as Code (e.g. Terraform)

About Us

We’re building AI creative tooling.

We’ve raised over $83M from the best investors in Silicon Valley.

We’re a team of 12 with millions of active users scaling aggressively.

#J-18808-Ljbffr

[job_alerts.create_a_job]

System Engineer • San Francisco, California, United States

[internal_linking.related_jobs]
IT Systems Engineer - East

IT Systems Engineer - East

Omada Health • South San Francisco, CA, United States
[job_card.full_time]
Candidates must reside on the East Coast in the U.Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time. As an IT Systems Engineer, you will play a critica...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Wireless Systems Engineer, Ranging and Sensing

Wireless Systems Engineer, Ranging and Sensing

Apple Inc. • San Francisco, CA, United States
[job_card.full_time]
Wireless Systems Engineer, Ranging and Sensing.San Francisco Bay Area, California, United States Hardware.At Apple, we work every single day to craft products that enrich people’s lives.Do you love...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Distributed Systems Engineer (Backend) [32495]

Distributed Systems Engineer (Backend) [32495]

Stealth Startup • San Francisco, California, United States
[job_card.full_time]
Get AI-powered advice on this job and more exclusive features.We're not just automating calls—we’re transforming how the world communicates. Our AI voice agents are reshaping sales, support, and cus...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Linux System / Platform Engineer

Linux System / Platform Engineer

Lawrence Berkeley National Laboratory • Berkeley, CA, United States
[job_card.full_time]
The National Energy Research Scientific Computing Center (NERSC) is seeking a versatile Linux System / Platform Engineer to join our team building and managing Linux-based infrastructure.More than ...[show_more]
[last_updated.last_updated_30] • [promoted]
Distributed Systems Engineer

Distributed Systems Engineer

krea.ai • San Francisco, CA, US
[job_card.full_time]
About Krea At Krea, we are building next-generation AI creative tools.We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Distributed Systems Engineer

Distributed Systems Engineer

E2b • San Francisco, CA, United States
[job_card.full_time]
Go, Building and managing large clusters, Linux, Networking, Kubernetes, Virtualization.Series A startup with 7-figure revenue. Our customers are companies like.Perplexity, Hugging Face, Manus, or G...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Software Engineer, Distributed Systems

Software Engineer, Distributed Systems

Replit • Foster City, California, United States
[job_card.full_time]
Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...[show_more]
[last_updated.last_updated_30] • [promoted]
Systems Engineer

Systems Engineer

Databento • San Francisco, California, USA
[job_card.full_time]
Databento is a startup that builds modern APIs to get financial data.As a Series A startup weve raised $37.M to date and grown our revenues by over 958% Y / Y in the past yearall with a team of fewer...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Distributed Systems Engineer — Exascale Storage

Senior Distributed Systems Engineer — Exascale Storage

OpenAI • San Francisco, CA, United States
[job_card.full_time]
A leading AI research company in California seeks a distributed systems engineer to design, build, and operate Exascale systems for managing research data. The ideal candidate will have expertise in...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Systems Engineer Authoritative DNS

Systems Engineer Authoritative DNS

Cloudflare • San Francisco, California, USA
[job_card.full_time]
At Cloudflare we are on a mission to help build a better Internet.Today the company runs one of the worlds largest networks that powers millions of websites and other Internet properties for custom...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Backend Engineer - Distributed Systems

Senior Backend Engineer - Distributed Systems

Verkada • San Mateo, California, United States
[job_card.full_time]
Designed with simplicity in mind, Verkada's six product lines — video security cameras, access control, environmental sensors, alarms, workplace, and intercoms — provide unparalleled building secur...[show_more]
[last_updated.last_updated_30] • [promoted]
Distributed Systems Engineer / AI Workloads (Alameda)

Distributed Systems Engineer / AI Workloads (Alameda)

The Crypto Recruiters • Alameda, CA, United States
[job_card.permanent]
We are actively searching for a Distributed Systems Engineer to join our team on a permanent basis.In this founding engineer role you will focus on building next-generation data infrastructure for ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Cloud-Native Distributed Systems Engineer

Senior Cloud-Native Distributed Systems Engineer

salesforce.com, inc. • San Francisco, CA, United States
[job_card.full_time]
A leading cloud-based software company in San Francisco is seeking a Distributed Systems Software Engineer for their Public Cloud teams. This role requires a related technical degree and 3+ years of...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Distributed Systems Engineer (Kafka & IaaS)

Senior Distributed Systems Engineer (Kafka & IaaS)

Roblox Corporation • San Mateo, CA, United States
[job_card.full_time]
A leading gaming platform is looking for a Senior Software Engineer to join their Queue team in San Mateo, California.This role focuses on evolving and operating a distributed queue system based on...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Software Engineer (Distributed Systems)

Software Engineer (Distributed Systems)

Browserbase, Inc. • San Francisco, CA, United States
[job_card.full_time]
As a Software Engineer (Distributed Systems) at.You’ll ensure it is high performance, scalable, constantly evolving and growing, and that our customers. As a Distributed Systems Engineer at Browserb...[show_more]
[last_updated.last_updated_30] • [promoted]
Datacenter & Power Systems Software Engineer

Datacenter & Power Systems Software Engineer

Emerald AI • San Francisco, CA, United States
[job_card.full_time]
A tech company in AI and energy is seeking a mid-senior level Software Engineer to enhance AI-driven datacenters by building software that optimizes power usage and ensures system resilience.The ca...[show_more]
[last_updated.last_updated_variable_hours] • [promoted]
Senior Systems Engineer

Senior Systems Engineer

Leidos Inc • San Francisco, CA, United States
[job_card.full_time]
Leidos is looking for a Systems Engineer with a TS / SCI with polygraph to support work on an information technology (IT) contract. Information Technology (IT) in support of its mission.The client's o...[show_more]
[last_updated.last_updated_1_day] • [promoted]
Distributed Systems Engineer - Data Platform - Logs and Audit Logs

Distributed Systems Engineer - Data Platform - Logs and Audit Logs

Cloudflare, Inc. • San Francisco, CA, United States
[job_card.full_time]
At Cloudflare, we are on a mission to help build a better Internet.Cloudflare protects and accelerates any Internet application online without adding hardware, installing software, or changing a li...[show_more]
[last_updated.last_updated_30] • [promoted]