Talent.com
Distributed Systems Engineer
Distributed Systems Engineerkrea.ai • San Francisco, CA, United States
Distributed Systems Engineer

Distributed Systems Engineer

krea.ai • San Francisco, CA, United States
[job_card.variable_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

About Krea

At Krea, we are building next‑generation AI creative tools. We are dedicated to making AI intuitive and controllable for creatives. Our mission is to build tools that empower human creativity, not replace it. We believe AI is a new medium that allows us to express ourselves through various formats—text, images, video, sound, and even 3D. We’re building better, smarter, and more controllable tools to harness this medium.

This job

Robust, reliable, and scalable distributed systems form the backbone of Krea. These systems support the infrastructure that powers our AI research, real‑time user experiences, and large‑scale model deployments. As a Distributed Systems Engineer, you will design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving. You will own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations. You will collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment. You will improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Responsibilities

  • Design, build, and maintain large‑scale distributed infrastructure to reliably support AI research and real‑time model serving.
  • Own and scale our multi‑thousand‑node Kubernetes GPU clusters, ensuring efficient and fault‑tolerant operations.
  • Collaborate closely with ML engineers and researchers to architect systems that enable rapid experimentation and deployment.
  • Improve network architecture, optimize load balancing, and streamline operational practices across multi‑zone cloud deployments.

Example Projects

  • Own and manage a large‑scale Kubernetes cluster designed to run extensive ML training and inference workloads.
  • Architect fault‑tolerant systems ensuring uninterrupted model training and real‑time inference despite individual node failures.
  • Develop and implement optimized load‑balancing strategies to efficiently distribute workloads across zones.
  • Create comprehensive monitoring, alerting systems, and operational playbooks for high‑availability clusters.
  • Migrate existing deployments to Infrastructure as Code (Terraform) for reproducibility and scalability.
  • Setting up IP‑based rate‑limiting to prevent GPU abuse.
  • Strong Candidates May Have Experience With

  • Kubernetes at scale (thousands of nodes)
  • Cloud infrastructure management (AWS / GCP / Azure)
  • High‑performance and fault‑tolerant networking
  • Low‑level Linux interfaces and administration
  • Debugging complex distributed systems in production
  • Python, Golang, Ruby, Rust, and similar systems languages
  • Bonus : Infrastructure as Code (e.g. Terraform)
  • About Us

  • We’re building AI creative tooling.
  • We’ve raised over $83M from the best investors in Silicon Valley.
  • We’re a team of 12 with millions of active users scaling aggressively.
  • #J-18808-Ljbffr

    [job_alerts.create_a_job]

    System Engineer • San Francisco, CA, United States

    [internal_linking.related_jobs]
    IT Systems Engineer - East

    IT Systems Engineer - East

    Omada Health • South San Francisco, CA, United States
    [job_card.full_time]
    Candidates must reside on the East Coast in the U.Omada Health is on a mission to inspire and engage people in lifelong health, one step at a time. As an IT Systems Engineer, you will play a critica...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Systems Engineer

    Senior Systems Engineer

    Zipline • South San Francisco, CA, US
    [job_card.full_time]
    South San Francisco, California, USA.Do you want to change the world? Zipline is on a mission to transform the way goods move. Our aim is to solve the world's most urgent and complex access chal...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Distributed Systems Engineer

    Distributed Systems Engineer

    VirtualVocations • Oakland, California, United States
    [job_card.full_time]
    A company is looking for a Distributed Systems Engineer (L6) in Commerce Product Data Engineering.Key Responsibilities Lead the technical vision for ML-oriented data products and drive strategy f...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer

    Systems Engineer

    Renegade • San Francisco, CA, US
    [job_card.full_time]
    Renegade is building an unstoppable network for the anonymous exchange of value.Our core permissionless protocol, the Renegade dark pool, solves many problems in current decentralized exchange desi...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer

    Senior Systems Engineer

    Saildrone • Alameda, CA, US
    [job_card.permanent]
    With more than 2 million nautical miles sailed and 50,000 days at sea, Saildrone has earned the trust of governments worldwide. Our unmanned surface vehicles (USVs) deliver continuous, real-time int...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Linux System / Platform Engineer

    Linux System / Platform Engineer

    Lawrence Berkeley National Laboratory • Berkeley, CA, United States
    [job_card.full_time]
    The National Energy Research Scientific Computing Center (NERSC) is seeking a versatile Linux System / Platform Engineer to join our team building and managing Linux-based infrastructure.More than ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Distributed Systems Engineer

    Distributed Systems Engineer

    E2b • San Francisco, CA, United States
    [job_card.full_time]
    Go, Building and managing large clusters, Linux, Networking, Kubernetes, Virtualization.Series A startup with 7-figure revenue. Our customers are companies like.Perplexity, Hugging Face, Manus, or G...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Software Engineer, Distributed Systems

    Software Engineer, Distributed Systems

    Replit • Foster City, California, United States
    [job_card.full_time]
    Replit is the fastest way to turn ideas into software.With our powerful AI-powered Agent and Assistant, anyone can create and launch apps from natural language in just one click.Build and deploy fu...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Living Systems Design Engineer

    Living Systems Design Engineer

    Habitat Horticulture • Berkeley, CA, US
    [job_card.full_time]
    Salary : $75,000 to $110,000 annually DOE.Habitat Horticulture is a leader in living architecture.We create living walls, large-scale interior gardens, and planted faades that bring buildings and ci...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Systems Engineer Authoritative DNS

    Systems Engineer Authoritative DNS

    Cloudflare • San Francisco, California, USA
    [job_card.full_time]
    At Cloudflare we are on a mission to help build a better Internet.Today the company runs one of the worlds largest networks that powers millions of websites and other Internet properties for custom...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Systems Engineer

    Systems Engineer

    Agtonomy • South San Francisco, CA, US
    [job_card.full_time]
    Agtonomy brings intelligent automation to agriculture, turf, and other demanding industries through Physical AI and software services. By partnering with trusted equipment manufacturers, we deliver ...[show_more]
    [last_updated.last_updated_30] • [promoted]
    IT Systems Engineer - Oakland (Hybrid)

    IT Systems Engineer - Oakland (Hybrid)

    Teleport • Oakland, CA, US
    [job_card.full_time]
    We help companies stay secure while moving fast.Built by engineers for engineers, The Teleport Access Platform delivers on-demand, least privileged access to infrastructure based on cryptographic i...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Cloud-Native Distributed Systems Engineer

    Senior Cloud-Native Distributed Systems Engineer

    salesforce.com, inc. • San Francisco, CA, United States
    [job_card.full_time]
    A leading cloud-based software company in San Francisco is seeking a Distributed Systems Software Engineer for their Public Cloud teams. This role requires a related technical degree and 3+ years of...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Distributed Systems Engineer

    Distributed Systems Engineer

    Success Matcher • San Francisco, CA, US
    [job_card.full_time]
    Our client is building next-generation cloud storage infrastructure—tech that could become as essential as AWS itself.They're looking for a deeply technical. AI, HPC, analytics, and beyond...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Distributed Systems Engineer (Kafka & IaaS)

    Senior Distributed Systems Engineer (Kafka & IaaS)

    Roblox Corporation • San Mateo, CA, United States
    [job_card.full_time]
    A leading gaming platform is looking for a Senior Software Engineer to join their Queue team in San Mateo, California.This role focuses on evolving and operating a distributed queue system based on...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    System Engineer

    System Engineer

    Ouster • San Francisco, CA, US
    [job_card.full_time]
    At Ouster, we design and manufacture LIDAR sensors for precision mapping, robotics, automotive, security systems, smart cities and various industrial solutions. We've transformed LIDAR fro...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Senior Systems Engineer

    Senior Systems Engineer

    Leidos Inc • San Francisco, CA, United States
    [job_card.full_time]
    Leidos is looking for a Systems Engineer with a TS / SCI with polygraph to support work on an information technology (IT) contract. Information Technology (IT) in support of its mission.The client's o...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Software Engineer, Distributed Systems

    Software Engineer, Distributed Systems

    OpenAI • San Francisco, CA, United States
    [job_card.full_time]
    The Compute Runtime team builds the low level framework components to power our ML training systems.We work on building robust, scalable, high performance components to support our distributed trai...[show_more]
    [last_updated.last_updated_30] • [promoted]