Talent.com
Data Engineer
Data EngineerInstitute of Foundation Models • Sunnyvale, CA, US
Data Engineer

Data Engineer

Institute of Foundation Models • Sunnyvale, CA, US
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

Job Description

Job Description
About the Institute of Foundation Models
We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.



The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing, you will quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.
Key Responsibilities
  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLPresearchers,delivering data within tight timelines.
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
Academic Qualifications
  • Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or PhD degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.
Professional Experience - Required
  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.
Professional Experience - Preferred
  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience working with large language models, including evaluation, efficient inference, and prompt engineering.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.
Visa Sponsorship
This position is eligible for visa sponsorship.

Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability


[job_alerts.create_a_job]

Data Engineer • Sunnyvale, CA, US

[internal_linking.similar_jobs]
Data Engineer

Data Engineer

Insight Global • Cupertino, CA, United States
[job_card.full_time]
The Data Foundations Engineer designs and scales modern data architectures powering Wallet, Payments, and Commerce products.This role focuses on building high-performance data pipelines and enablin...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Platform Engineer

Data Platform Engineer

CEREBRAS SYSTEMS INC. • Sunnyvale, CA, United States
[job_card.full_time]
Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs.Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programm...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Lead Data Engineer

Lead Data Engineer

Adobe • San Jose, CA, United States
[job_card.full_time]
Changing the world through digital experiences is what Adobe’s all about.We give everyone-from emerging artists to global brands-everything they need to design and deliver exceptional digital exper...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer- Mountain View, CA

Data Engineer- Mountain View, CA

Staffing the Universe • Mountain View, CA, United States
[job_card.full_time]
Required Skills: 10+ years of overall experience in data management space and at least 5 years of working in large data sets in a data lake environment.Highly proficient in SQL, solid understanding...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Redolent • Sunnyvale, CA, United States
[job_card.full_time]
Designs, develops, and implements Hadoop eco-system based applications to support business requirements.Follows approved life cycle methodologies, creates design documents, and performs program cod...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Exaways Corporation • Mountain View, CA, United States
[job_card.full_time]
Solid understanding of Spark including performance tuning.Solid understanding of the AWS Platform.Ability to work with Business and technical stake holders independently with minimal guidance - Mus...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Omni Inclusive • Sunnyvale, CA, United States
[job_card.full_time]
Expertise in SQL: You possess a deep understanding of SQL for data manipulation, querying, and performance optimization across various database systems.Proficiency in Python: Your Python skills are...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Databricks Data Engineer - Senior - Consulting - Location Open

Databricks Data Engineer - Senior - Consulting - Location Open

EY • San Jose, CA, United States
[job_card.full_time]
At EY, we're all in to shape your future with confidence.We'll help you succeed in a globally connected powerhouse of diverse teams and take your career wherever you want it to go.Join EY and help ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer Lead

Data Engineer Lead

TechDigital Corporation • Fremont, CA, United States
[job_card.full_time]
Microsoft Azure: Azure Data Factory, Azure Synapse, Azure Databricks, Azure SQL, Event Hub Data streaming (e.Azure Event Hubs, Kafka) infrastructure-as-code (e.Terraform, ARM templates) SQL, Python...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Senior Data Engineer (Snowflake & DBT)

Senior Data Engineer (Snowflake & DBT)

Concord IT Systems • Pleasanton, CA, United States
[job_card.full_time]
Location: Pleasanton, California (hybrid work).Client : Critical River | Meeru AI.As a Senior/Lead Data Engineer, you will lead the design, development, and ownership of core data infrastructure-fr...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Kaav Inc. • San Jose, CA, United States
[job_card.full_time]
Who You Are: Data Engineers are focused on enabling a data-driven approach to optimization by sourcing, maintaining and ensuring the availability of data used to drive business decisions.We are loo...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer III

Data Engineer III

Artech • Cupertino, CA, United States
[job_card.full_time]
We are seeking a skilled professional to join our team in a role focused on processing battery testing data.This position supports the delivery of battery algorithms and simulations to validate the...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Platform Engineer

Data Platform Engineer

BrightAI Corporation • Palo Alto, CA, United States
[job_card.full_time]
We are a high-growth company transforming how businesses operate by integrating AI, IoT, and cloud-native services into scalable, real-time platforms.You'll join a multidisciplinary team focused on...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer GCP

Data Engineer GCP

Interaction24 LLC • Sunnyvale, CA, United States
[job_card.full_time]
About the job Data Engineer GCP.GoogleAds Expertise is a plus (campaign/adsgroups setup APIs).[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Yochana • San Jose, CA, United States
[job_card.full_time]
MINIMUM 10 YEARS OF EXPERIENCE.Data Pipeline Development: Design, build, and maintain scalable, efficient, and reliable ETL/ELT data pipelines for batch and real-time processing using GCP services ...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer

Data Engineer

Apex Informatics • Pleasanton, CA, United States
[job_card.full_time]
Bachelor's degree or equivalent experience in computer science, applied math, physics, engineering, statistics, economics or related field.SQL, Python, PySpark, Jupyter Notebook, database design, a...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Data Engineer II

Data Engineer II

Futran Tech Solutions Pvt. Ltd. • Mountain View, CA, United States
[job_card.full_time]
Duration - 6 Months + Possible extension.This role leans heavily into core data engineering - from pipeline development to data modeling - and requires hands-on expertise with modern data stack too...[show_more]
[last_updated.last_updated_variable_days] • [promoted]
Staff, Data Engineer

Staff, Data Engineer

Walmart • Sunnyvale, CA, United States
[job_card.full_time] +1
Consults with senior leadership/ business partners (e.Directors/Directors) to understand overall goals/functional objectives, decipher key business challenges, identify tactics to anticipate and mi...[show_more]
[last_updated.last_updated_variable_days] • [promoted]