We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on-premise instruments, data systems, and machine learning pipelines. This role combines systems-level engineering with software craftsmanship, requiring deep understanding of how compute, storage, and networking layers interact under real workloads.
About Mango, Inc.
Mango is a new type of microscope for rapid bioburden testing.
Description
We are seeking a Senior Site Reliability Engineer to own and evolve the infrastructure that supports our on-premise instruments, data systems, and machine learning pipelines. This role combines systems-level engineering with software craftsmanship , requiring deep understanding of how compute, storage, and networking layers interact under real workloads.
You will be the go-to expert for diagnosing performance issues in our on-prem system. This could be from kernel-level I / O bottlenecks to distributed service latency. In addition to building robust automation that keeps our systems consistent and observable.
Key Responsibilities
Infrastructure Design & Reliability
Design, deploy, and maintain our on-premise and hybrid infrastructure which includes Dell PowerEdge and PowerVault servers, prosumer NAS units, and high-throughput data processing clusters. Implement fault-tolerant systems with reproducible deployments and clear observability.
Performance & Systems Analysis
Investigate complex performance issues across hardware, OS, and software boundaries. You will be using Linux toolin addition to in-house application-level metrics to uncover root causes in filesystems, caching layers, or I / O scheduling.
Automation & Tooling
Build automation for system provisioning, configuration management, and software deployment using Python, Go, Ansible, or similar frameworks. Develop lightweight services and tools that make reliability visible and maintainable.
Collaboration
Work closely with our software and hardware teams to co-design systems that meet the needs of high-resolution imaging and ML inference workloads. Translate hardware realities into software reliability guarantees.
Observability & Incident Response
Develop and maintain monitoring, alerting, and logging systems to ensure early detection of issues. Lead incident response and post-mortem efforts with a focus on learning and prevention.
Documentation & Communication
Produce clear documentation and communicate findings effectively to the broader team - from network topology diagrams to kernel tuning rationales.
General Qualifications
Bonus Qualities (Not Required)
Salary
$150,000 - $175,000 per year
Senior Site Reliability Engineer • Los Angeles, CA, United States