Lead Test Engineer - Ai Servers & Storage (FW, NVMe, SATA, SSDs, HDDs, DIMMS)
Fulltime position Must work onsite 4 days a week in Richardson, TX
CONFIDENTIAL : Publicly traded computer hardware infrastructure platform solutions company with over $3 Billion in sales whose stock price has grown over 200% in the last year because their products and services are used within Ai Data Centers .
Must have great interpersonal skills and experience TESTING enterprise-level Storage & Server infrastructure systems for AI Data Centers
The Senior Lead Test (& Validation) Engineer - Storage & Server Infrastructure Systems will play a pivotal role in the design, development, and execution of comprehensive test strategies for AI data center's storage and server infrastructure . (HW + FM + SW).
This leadership position requires deep expertise in enterprise storage systems, server architectures, networking, and a strong understanding of the unique performance and reliability demands of AI / ML workloads . The ideal candidate will be a hands-on technical leader.
Responsibilities :
- Define, develop, and implement comprehensive test plans and strategies for all storage and server hardware, firmware, and software components within the AI Data Center environment.
- Lead the Test team in designing, executing, and analyzing complex test cases, including functional, performance, reliability, stress, and endurance testing.
- Design and implement automated test frameworks and scripts using languages like Python, Go, or similar, to improve efficiency and coverage of testing.
- Conduct in-depth performance analysis and bottleneck identification for storage systems (e.g., NVMe, SSD, HDD arrays, distributed storage, SAN / NAS) and server platforms (e.g., CPU, GPU, memory, PCIe, networking), and OpenBMC interfaces / features.
- Debug issues related to BMC functionality and its interaction with server hardware.
- Develop and maintain robust testbeds and infrastructure for continuous integration and validation.
- Utilize open-source and commercial test tools relevant to storage, server, and OpenBMC validation.
- Collaborate closely with hardware design, software development, infrastructure, and AI / ML engineering teams to understand requirements and integrate testing throughout the product lifecycle.
- Communicate test progress, results, and critical issues effectively to stakeholders, including executive leadership.
- Develop specialized test methodologies to validate performance and reliability under heavy AI / ML workloads (e.g., large model training, inference at scale, data ingestion).
- Understand and test the interactions between GPU -accelerated computing, high-speed networking, and storage systems.
REQUIREMENTS
BS with 8+ years of hands-on hardware VALIDATION and platform TEST engineering experience with direct exposure to AI Data Center Server & Storage components including NVMe, SATA, SSDs, HDDs, DIMMS, and system-level platforms used in large-scale cloud environments.Need someone that is firmly rooted in HARDWARE and FIRMWARE Validation.Must have 2+ years of experience in a LEAD or senior technical role, leading test initiatives, assigning and guiding junior test engineers.Must be very Hands-On with NVMe, SATA, SSDs, HDDs, DIMMS .Great interpersonal skills & English Communication skills , with the ability to collaborate effectively across diverse teams and with vendors and customers.Strong in Debugging server Hardware ( BMC, PCIe , networking).Strong in AI / ML workload optimization ( TensorFlow, PyTorch ) and their infrastructure requirements.Strong Linux and Python / GO Automation, and Strong Perf analysis of storage / server platforms.Familiarity with OCP (Open Compute Project).Certifications in relevant technologies (e.g., NetApp, Dell EMC, HPE, NVIDIA ). Distributed Storage validation.Contribute to platform Firmware validation testing, BIOS bring up.Must work onsite 4 days a week in Richardson, TX.