Reliability EngineerEtched • Cupertino, CA, US

Reliability Engineer

Etched • Cupertino, CA, US

[job_card.30_days_ago]

[job_preview.job_type]

[job_card.full_time]

[job_card.job_description]

Job Description

About Etched

Etched is building AI chips that are hard-coded for individual model architectures. Our first product (Sohu) only supports transformers, but has an order of magnitude more throughput and lower latency than a B200. With Etched ASICs, you can build products that would be impossible with GPUs, like real-time video generation models and extremely deep chain-of-thought reasoning.

Reliability Engineer

We are seeking a skilled and detail-oriented Reliability Engineer to join our team. As a Reliability Engineer at Etched, you will play a critical role in ensuring that all components and systems meet our rigorous reliability standards, essential for our datacenter applications. This position requires a deep understanding of reliability engineering principles, as well as experience working with suppliers, ODMs, and JDMs.

Representative Projects :

Lead the development, implementation, and management of reliability standards for all suppliers working with Etched. Ensure that all components and systems meet or exceed the required reliability benchmarks.
Review and verify reliability reports from suppliers, ensuring accuracy and adherence to Etched's standards. Provide guidance and feedback to suppliers to ensure continuous improvement in reliability performance.
Collaborate with cross-functional teams to review and recommend component selection criteria based on reliability performance. Ensure that all selected components are capable of meeting the long-term reliability requirements of our datacenter applications.
Evaluate and approve reliability test plans proposed by external vendors. Ensure that the test methodologies and conditions are sufficient to validate long-term reliability under expected operating conditions.
Conduct in-depth analysis of reliability data provided by suppliers and vendors. Identify trends, potential issues, and areas for improvement to enhance overall reliability.
Work closely with ODMs (Original Design Manufacturers) and JDMs (Joint Design Manufacturers) to ensure that all products meet Etched quality and reliability standards. Provide technical guidance and support to maintain maximum operational uptime and long-term reliability.
Review and establish reliability metrics and standards for silicon components, ensuring they meet the stringent requirements for long-term reliability in data center environments.

You maybe a good fit if you have

Bachelor's or Master's degree in Reliability Engineering, Electrical Engineering, or a related field.

5+ years of experience in reliability engineering, with a focus on datacenter applications preferred.

Strong understanding of reliability standards, testing methodologies, and data analysis techniques. DFMEA / PFMEA / SPC Engineering analysis experience desired.

Experience working with suppliers, ODMs, and JDMs in a high-tech environment.

Excellent communication skills, with the ability to convey complex technical concepts to diverse stakeholders.

Proven ability to manage multiple projects and deliver results in a fast-paced environment.

We encourage you to apply even if you do not believe you meet every single qualification.

How we're different :

Etched believes in the Bitter Lesson. We think most of the progress in the AI field has come from using more FLOPs to train and run models, and the best way to get more FLOPs is to build model-specific hardware. Larger and larger training runs encourage companies to consolidate around fewer model architectures, which creates a market for single-model ASICs.

We are a fully in-person team in Cupertino, and greatly value engineering skills. We do not have boundaries between engineering and research, and we expect all of our technical staff to contribute to both as needed.

Benefits :

Full medical, dental, and vision packages, with 100% of premium covered, 90% for dependents

Housing subsidy of $2,000 / month for those living within walking distance of the office

Daily lunch and dinner in our office

Relocation support for those moving to Cupertino

[job_alerts.create_a_job]

Reliability Engineer • Cupertino, CA, US

[internal_linking.similar_jobs]

Reliability Engineer

nEye Systems • Santa Clara, CA, US

[job_card.full_time]

Eye’s MEMS-based silicon photonics optical circuit switches (OCS) eliminate critical bottlenecks in AI processing by enabling direct optical connections among thousands of GPUs and memory uni...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Reliability Quality Engineer

PROCEPT BioRobotics • San Jose, CA, US

[job_card.permanent]

Embark on an enriching journey with PROCEPT BioRobotics, where our vision, mission, and values guide everything we do as a company. At PROCEPT, we put the patient first in everything we do and ...[show_more]

[last_updated.last_updated_30] • [promoted]

Infrastructure Reliability Engineer, Bare Metal

CoreWeave • Sunnyvale, CA, US

[job_card.permanent]

CoreWeave is The Essential Cloud for AI™.Built for pioneers by pioneers, CoreWeave delivers a platform of technology, tools, and teams that enables innovators to build and scale AI with confi...[show_more]

[last_updated.last_updated_30] • [promoted]

Package Reliability Engineer

Celestial AI • Santa Clara, CA, US

[job_card.full_time]

As Generative AI continues to advance, the performance drivers for data center infrastructure are shifting from systems-on-chip (SOCs) to systems of chips. In the era of Accelerated Computing, data ...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer, AI / ML Infrastructure

Boson AI • Santa Clara, CA, US

[job_card.full_time]

We're looking for a Senior Site Reliability Engineer to help us run one of the most exciting GPU clusters around—our Toronto datacenter packed with NVIDIA H100 and A100 GPUs, over 20PB of...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineering

Forhyre • Sunnyvale, CA, US

[job_card.full_time]

Forhyre is looking for engineers who can bring unique perspectives and innovative ideas to all areas of development and are interested in continuing to improve our platform through the ever-changin...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer (SRE) / DevOps Engineer

E-Space • Saratoga, CA, US

[job_card.full_time]

Ready to make connectivity from space universally accessible, secure, and actionable? Then you’ve come to the right place!. At E-Space, we’re focused on bridging Earth and space with the...[show_more]

[last_updated.last_updated_30] • [promoted]

Product Infrastructure Engineer - Site Reliability

Zyphra • Palo Alto, CA, US

[job_card.full_time]

Infrastructure Engineer - Site Reliability.Your work will be essential to ensuring the reliability and reproducibility of ML workloads, the safety and control of deployments, and the long-term main...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

PsiQuantum • Palo Alto, CA, United States

[job_card.full_time]

Quantum computing holds the promise of humanity's mastery over the natural world, but only if we can build a.PsiQuantum is on a mission to build the first real, useful quantum computers, capable of...[show_more]

[last_updated.last_updated_30] • [promoted]

Principal Site Reliability Engineer (Prisma AIRS)

Palo Alto Networks • Santa Clara, CA, US

[job_card.full_time]

At Palo Alto Networks® everything starts and ends with our mission : .Being the cybersecurity partner of choice, protecting our digital way of life. Our vision is a world where each day is safer a...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Amiri Recruiting • Mountain View, CA, US

[job_card.full_time]

Relevant Skills and Experience.What You’ll Do (Day-to-Day).Own and manage our cloud infrastructure (GCP or AWS, on-prem). Build, maintain, and optimize Kubernetes clusters (including GPU-backe...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Remote

PayNearMe • Santa Clara, CA, US

[filters.remote]

[job_card.full_time]

At PayNearMe, we’re on a mission to make paying and getting paid as simple as possible.We build innovative technology that transforms the way businesses and their customers experience payment...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Kubernetes Platform

Pantera Capital • Palo Alto, CA, United States

[job_card.full_time]

AI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excelle...[show_more]

[last_updated.last_updated_variable_days] • [promoted]

Site Reliability Engineer

Foxconn Industrial Internet - FII • San Jose, CA, US

[job_card.full_time] +1

Foxconn Industrial Internet (Fii), is a world leading professional design and manufacturing service provider of communication network equipment, cloud service equipment, precision tools and industr...[show_more]

[last_updated.last_updated_30] • [promoted]

Sr. Reliability Engineer (26861)

Supermicro • San Jose, California, United States

[job_card.full_time]

Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer

Cryptoware Technologies Inc • Santa Clara, CA, US

[job_card.full_time]

Lead the effort of global expansion of Huobi globe spanning infrastructure.Work with engineering teams to make sure new features and changes are deployed quickly and safely.Constantly improve our s...[show_more]

[last_updated.last_updated_30] • [promoted]

Reliability Systems Engineer | EAG Laboratories

Eurofins USA Material Sciences • Santa Clara, CA, US

[job_card.permanent]

Eurofins Scientific is a global leader in analytical testing, operating over 950 labs in 60 countries with 65,000 employees. EAG Laboratories, part of Eurofins, offers advanced services in analytica...[show_more]

[last_updated.last_updated_30] • [promoted]

Site Reliability Engineer - Kubernetes Platform

xAI • Palo Alto, CA, US

[job_card.full_time]

AI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering exc...[show_more]

[last_updated.last_updated_30] • [promoted]