Talent.com
Data Engineer
Data EngineerInstitute Of Foundation Models • Sunnyvale, California, United States
Data Engineer

Data Engineer

Institute Of Foundation Models • Sunnyvale, California, United States
[job_card.30_days_ago]
[job_preview.job_type]
  • [job_card.full_time]
[job_card.job_description]

About the Institute of Foundation Models

We are a dedicated research lab for building, understanding, using, and risk-managing foundation models. Our mandate is to advance research, nurture the next generation of AI builders, and drive transformative contributions to a knowledge-driven economy.

As part of our team, you’ll have the opportunity to work on the core of cutting-edge foundation model training, alongside world-class researchers, data scientists, and engineers, tackling the most fundamental and impactful challenges in AI development. You will participate in the development of groundbreaking AI solutions that have the potential to reshape entire industries. Strategic and innovative problem-solving skills will be instrumental in establishing MBZUAI as a global hub for high-performance computing in deep learning, driving impactful discoveries that inspire the next generation of AI pioneers.

The Role

As a Data Engineer specializing in Natural Language Processing (NLP) and large-scale data processing, you will quickly and effectively gather, curate, and prepare high-quality datasets to support cutting-edge NLP research. Your role will be instrumental in enabling researchers by delivering essential data through efficient and scalable engineering practices, including web crawling, LLM-generated content refinement, and robust data pipelines, primarily leveraging Python and related technologies.

Key Responsibilities

  • Rapidly collect, curate, and preprocess datasets based on detailed specifications provided by NLP researchers, delivering data within tight timelines (typically within 1-2 days).
  • Develop and maintain efficient web crawling solutions, APIs, and automated workflows to continuously improve data collection processes.
  • Refine and evaluate outputs from Large Language Models (LLMs) to generate structured datasets suitable for model training and benchmarking.
  • Implement scalable data pipelines, ensuring efficient data processing, storage, retrieval, and distribution to research teams.
  • Collaborate closely with researchers and engineers to ensure collected data meets specified quality and relevance criteria.
  • Document data collection methodologies, dataset characteristics, and pipeline architecture clearly and effectively.
  • Engage with peer teams and participate in technical reviews to uphold best practices and data quality standards.
  • Represent MBZUAI at industry and research forums, showcasing technical capabilities in large-scale data processing and AI data infrastructure.
  • Perform all other duties as reasonably directed by the line manager commensurate with these functional objectives.

Academic Qualifications

  • Bachelor's degree in Computer Science, Data Science, Engineering, or a related technical field required
  • Master’s degree or equivalent experience in Computer Science, Data Engineering, or related technical fields preferred.
  • Professional Experience - Required

  • Extensive experience in data engineering, data processing, and automation using Python.
  • Demonstrated proficiency in designing and deploying web crawling solutions, automated data extraction, and processing pipelines.
  • Strong understanding of data structures, algorithms, databases, SQL, and performance optimization.
  • Experience working with cloud infrastructure and distributed data processing frameworks (e.g., AWS, Spark, Kafka, Kubernetes).
  • Excellent problem-solving abilities, attention to detail, and the capability to rapidly address technical challenges.
  • Strong communication and collaboration skills with cross-functional teams.
  • Professional Experience - Preferred

  • Proven track record of supporting NLP or AI research teams with rapid and reliable data delivery.
  • Experience with refining outputs from large-scale AI models, such as LLM-generated data.
  • Contributions to open-source projects, coding competitions, or high visibility in coding communities (e.g., GitHub, Stack Overflow).
  • Familiarity with the latest advancements in NLP data processing and large language model technologies.
  • $100,000 - $500,000 a year

    Visa Sponsorship

    This position is eligible for visa sponsorship.

    Benefits Include

  • Comprehensive medical, dental, and vision benefits
  • Bonus
  • 401K Plan
  • Generous paid time off, sick leave and holidays
  • Paid Parental Leave
  • Employee Assistance Program
  • Life insurance and disability
  • [job_alerts.create_a_job]

    Data Engineer • Sunnyvale, California, United States

    [internal_linking.similar_jobs]
    Staff Data Engineer

    Staff Data Engineer

    Elastic • Mountain View, CA, United States
    [job_card.full_time]
    Elastic, the Search AI Company, enables everyone to find the answers they need in real time, using all their data, at scale - unleashing the potential of businesses and people.The Elastic Search AI...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    AI Data Engineer

    AI Data Engineer

    HartleyCo • Santa Clara, CA, United States
    [job_card.full_time]
    Member of Technical Staff - AI Data Engineer.A high-growth, AI-native startup coming out of stealth is hiring AI Data Engineers to build the systems that power production-grade AI.The company has r...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Principal Data Engineer

    Principal Data Engineer

    Sanas • Palo Alto, CA, United States
    [job_card.full_time]
    Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world’s first real-time speech transformation platform capable of accent translati...[show_more]
    [last_updated.last_updated_30] • [promoted]
    AI Data Engineer

    AI Data Engineer

    InsideHigherEd • Stanford, California, United States
    [job_card.full_time]
    Business Affairs : University IT (UIT), Redwood City, California, United States.Information Technology Services📅Sep 08, 2025 Post Date📅107222 Requisition #. Are you an experienced AI / GenAI en...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Software & Data Engineer – AI Pipelines & Impact

    Senior Software & Data Engineer – AI Pipelines & Impact

    Apple • Santa Clara, CA, United States
    [job_card.full_time]
    An innovative company is seeking a Senior Software and Data Engineer to join their dynamic Data Team.This role offers the chance to work on cutting-edge machine learning and AI technologies that im...[show_more]
    [last_updated.last_updated_1_day] • [promoted]
    Data Engineer

    Data Engineer

    Programmers.io • Sunnyvale, CA, United States
    [job_card.full_time]
    Onsite Role in Sunnyvale, CA, United States.CONTRACT ROLE - ONLY H1B - OPEN FOR C2C.The role involves building, optimizing, and maintaining data pipelines and integrations to support analytics and ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Lead Data Engineer

    Lead Data Engineer

    Mentor Talent Acquisition • Hayward, California, United States
    [job_card.full_time]
    We’re looking for a Lead Data Engineer to spearhead the design, implementation, and iteration of a world-class, modern data infrastructure that powers analytics, data science, and ML / AI systems.You...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Enterprise Datawarehouse Developer

    Senior Enterprise Datawarehouse Developer

    Fortinet • Sunnyvale, CA, United States
    [job_card.full_time]
    We are seeking an experienced Senior Enterprise Datawarehouse Developer to join our team.As a Senior Datawarehouse Engineer, you will provide advanced data warehousing design solutions , support, m...[show_more]
    [last_updated.last_updated_30] • [promoted]
    ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

    ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term)

    InsideHigherEd • Stanford, California, United States
    [job_card.full_time] +1
    ML Data Engineer – Healthcare Data Curation & Cleaning (1 Year Fixed Term).School of Medicine, Stanford, California, United States. Information Analytics📅Jun 03, 2025 Post Date📅106579 Requis...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Data Engineer

    Data Engineer

    Midjourney • Hayward, California, United States
    [job_card.full_time]
    Midjourney is a research lab exploring new mediums to expand the imaginative powers of the human species.We are a small, self-funded team focused on design, human infrastructure, and AI.We have no ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Senior Data Engineer

    Senior Data Engineer

    Sigmaways Inc • Santa Clara, CA, United States
    [job_card.full_time]
    If you’re hands on with modern data platforms, cloud tech, and big data tools and you like building solutions that are secure, repeatable, and fast, this role is for you. As a Senior Data Engineer, ...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    MlOps / Data Engineer

    MlOps / Data Engineer

    TEKsystems • Cupertino, CA, United States
    [job_card.full_time]
    Expected skills : Python, Golang / Rust (nice to have).Data Engineering tools : pyiceberg, daft to name a few.The candidate should be familiar with data engineering supporting and building systems at P...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Staff Data Center Design Engineer

    Staff Data Center Design Engineer

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Software Engineer - AI Agent Infrastructure (Healthcare)

    Software Engineer - AI Agent Infrastructure (Healthcare)

    Honey Health • Hayward, CA, US
    [job_card.full_time]
    Honey Health is the all-in-one AI back office for primary and specialty care.Our AI agents autonomously handle core back-office jobs, such as aggregating patient data, processing orders and prescri...[show_more]
    [last_updated.last_updated_30] • [promoted]
    Data Engineer / Analytics Specialist (Fremont)

    Data Engineer / Analytics Specialist (Fremont)

    ITTConnect • Fremont, CA, US
    [job_card.part_time]
    Technology Consulting firm with headquarters in Europe.They are experts in tailored technology consulting and services to banks, investment firms and other Financial vertical clients.Ability to com...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]
    Service Engineer

    Service Engineer

    Supermicro • San Jose, CA, United States
    [job_card.full_time]
    Supermicro is a Top Tier provider of advanced server, storage, and networking solutions for Data Center, Cloud Computing, Enterprise IT, Hadoop / Big Data, Hyperscale, HPC and IoT / Embedded customers...[show_more]
    [last_updated.last_updated_30] • [promoted]
    GCP Data Engineer (Onsite)

    GCP Data Engineer (Onsite)

    SRI Tech Solutions Inc. • Santa Clara, CA, United States
    [job_card.full_time]
    We are looking for a highly skilled and motivated Data Engineer to join our team.The ideal candidate will be responsible for designing, building, and maintaining scalable data infrastructure that d...[show_more]
    [last_updated.last_updated_variable_hours] • [promoted] • [new]
    Senior Snowflake Data Engineer

    Senior Snowflake Data Engineer

    Zensar Technologies • Santa Clara, CA, United States
    [job_card.full_time]
    We’re a bunch of hardworking, fun-loving, people-oriented technology enthusiasts.We love what we do, and we’re passionate about helping our clients thrive in an increasingly complex digital world.Z...[show_more]
    [last_updated.last_updated_variable_days] • [promoted]