Reliability EngineerEtched • Cupertino, CA, US

Reliability Engineer

Etched • Cupertino, CA, US

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Job Description

About Etched

Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep chain-of-thought reasoning.

Reliability Engineer

We are seeking a skilled and detail-oriented Reliability Engineer to join our team. As a Reliability Engineer at Etched, you will play a critical role in ensuring that all components and systems meet our rigorous reliability standards, essential for our datacenter applications. This position requires a deep understanding of reliability engineering principles, as well as experience working with suppliers, ODMs, and JDMs.

Representative Projects :

Lead the development, implementation, and management of reliability standards for all suppliers working with Etched. Ensure that all components and systems meet or exceed the required reliability benchmarks.
Review and verify reliability reports from suppliers, ensuring accuracy and adherence to Etched's standards. Provide guidance and feedback to suppliers to ensure continuous improvement in reliability performance.
Collaborate with cross-functional teams to review and recommend component selection criteria based on reliability performance. Ensure that all selected components are capable of meeting the long-term reliability requirements of our datacenter applications.
Evaluate and approve reliability test plans proposed by external vendors. Ensure that the test methodologies and conditions are sufficient to validate long-term reliability under expected operating conditions.
Conduct in-depth analysis of reliability data provided by suppliers and vendors. Identify trends, potential issues, and areas for improvement to enhance overall reliability.
Work closely with ODMs (Original Design Manufacturers) and JDMs (Joint Design Manufacturers) to ensure that all products meet Etched quality and reliability standards. Provide technical guidance and support to maintain maximum operational uptime and long-term reliability.
Review and establish reliability metrics and standards for silicon components, ensuring they meet the stringent requirements for long-term reliability in data center environments.

You maybe a good fit if you have

Bachelor's or Master's degree in Reliability Engineering, Electrical Engineering, or a related field.

5+ years of experience in reliability engineering, with a focus on datacenter applications preferred.

Strong understanding of reliability standards, testing methodologies, and data analysis techniques. DFMEA / PFMEA / SPC Engineering analysis experience desired.

Experience working with suppliers, ODMs, and JDMs in a high-tech environment.

Excellent communication skills, with the ability to convey complex technical concepts to diverse stakeholders.

Proven ability to manage multiple projects and deliver results in a fast-paced environment.

We encourage you to apply even if you do not believe you meet every single qualification.

How we're different :

Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

We are a fully in-person team in Cupertino, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

Benefits :

Full medical, dental, and vision packages, with 100% of premium covered, 90% for dependents

Housing subsidy of $2,000 / month for those living within walking distance of the office

Daily lunch and dinner in our office

Relocation support for those moving to Cupertino

[job_alerts.create_a_job]

Reliability Engineer • Cupertino, CA, US

[internal_linking.similar_jobs]

Reliability Engineer

nEye Systems • Santa Clara, CA, US

[job_card.full_time]

Eye’s MEMS-based silicon photonics optical circuit switches (OCS) eliminate critical bottlenecks in AI processing by enabling direct optical connections among thousands of GPUs and memory uni...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Fortinet • Sunnyvale, CA, United States

[job_card.full_time]

At Fortinet, we strive to provide a supportive, collaborative environment where people are empowered to do the best work of their careers. Our team members enjoy solving complex problems, and obsess...[show_more]

[last_updated.last_updated_30] • [promoted]

Senior Technology Site Reliability Engineer

Cooley LLP • Palo Alto, CA, United States

[job_card.full_time]

Senior Technology Site Reliability Engineer.Cooley is seeking a Senior Site Reliability Engineer to join the.Infrastructure & Development Operations. The Senior Technology Site Reliability Engineer(...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

[job_card.full_time]

PsiQuantum'smission is to build the first useful quantum computers-machines capable of delivering the breakthroughs the field has long promised. Since our founding in 2016, our singular focus has be...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer (Prisma AIRS)

Palo Alto Networks • Santa Clara, CA, US

[job_card.full_time]

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer - Kubernetes Platform

Pantera Capital • Palo Alto, CA, United States

[job_card.full_time]

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Archetype AI • Palo Alto, CA, United States

[job_card.full_time]

Get AI-powered advice on this job and more exclusive features.Archetype AI is developing the world's first AI platform to bring AI into the real world. Formed by an exceptionally high-caliber team f...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Sr. Reliability Engineer / Sustaining

Rivian • Palo Alto, CA, United States

[job_card.full_time]

Rivian is on a mission to keep the world adventurous forever.This goes for the emissions‑free Electric Adventure Vehicles we build, and the curious, courageous souls we seek to attract.As a company...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Amiri Recruiting • Mountain View, CA, US

[job_card.full_time]

Relevant Skills and Experience.What You’ll Do (Day-to-Day).Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backe...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Remote

PayNearMe • Santa Clara, CA, US

[filters.remote]

[job_card.full_time]

At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payment...[show_more]

[last_updated.last_updated_30] • [promoted]

Customer Reliability Engineer

Cisco Systems, Inc. • San Jose, CA, United States

[job_card.full_time]

This is a fully remote position open to candidates located in the United States with a strong preference for candidates based on the West Coast, with the ability to work in the Pacific Time Zone.Ap...[show_more]

[last_updated.last_updated_30] • [promoted]

Sr. Reliability Engineer (26861)

Supermicro • San Jose, California, United States

[job_card.full_time]

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Foxconn Industrial Internet - FII • San Jose, CA, US

[job_card.full_time] +1

Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, precision tools and industr...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Observability

Rivian and Volkswagen Group Technologies • Palo Alto, CA, United States

[job_card.full_time]

Senior Site Reliability Engineer (SRE).RivianVW's Data Platform - Production Engineering team.In this role, you will design, implement, and scale robust observability systems to ensure the health, ...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Cryptoware Technologies Inc • Santa Clara, CA, US

[job_card.full_time]

Lead the effort of global expansion of Huobi globe spanning infrastructure.Work with engineering teams to make sure new features and changes are deployed quickly and safely.Constantly improve our s...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer – Kubernetes

Theklicker • Palo Alto, CA, United States

[job_card.full_time]

We are dedicated to being a one-stop solution for purchasing electronic products.With a focus on delivering the best user experience, theklicker empowers users to make informed purchasing decisions...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Staff Reliability Engineer | Systems Core

Luma AI • Palo Alto, CA, United States

[job_card.full_time]

Staff Reliability Engineer | Systems Core.Staff Reliability Engineer | Systems Core.Five days ago Be among the first 25 applicants. This range is provided by Luma AI.Your actual pay will be based on...[show_more]

[last_updated.last_updated_variable_hours] • [promoted] • [new]

Site Reliability Engineer - Kubernetes Platform

xAI • Palo Alto, CA, US

[job_card.full_time]

AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...[show_more]

[last_updated.last_updated_30] • [promoted]