Job Description

Staff/Senior Data Engineer: AI Training Data (2-4 Months Contract)


Location: Remote

Role Type: Contract (2-4 Months)

Time Commitment: 40 hrs/week (Full-time availability required)

Compensation: Hyper-competitive hourly rate (matching Tier-1 Staff engineering bands) Experience: 6-12+ years


About BespokeLabs

BespokeLabs is a premier, VC-backed AI Research lab with an exceptionally talent-dense team of IIT and Ivy League alumni. We don’t just build tooling around AI—we build the massive-scale data systems and reasoning architectures that directly power next-generation models. Our research shapes the frontier of AI: we’ve published breakthroughs like GEPA, driven foundational datasets like OpenThoughts, and shipped state-of-the-art models including Bespoke-MiniCheck and Bespoke-MiniChart. More on our website :)


Role Overview

We are looking for a top-tier Senior/Staff Data Engineer for a high-impact, 2-month sprint. You will leverage your deep expertise in enterprise-grade data platforms to architect and build the complex curation systems required for advanced AI model training.

This is not a traditional ETL pipeline role. We need a heavy-hitter who has already operated production data platforms at scale inside large, complex organizations (FAANG, Fortune 100). You will use the mental models, architectural intuition, and coding skills you've developed over your career to generate, transform, and evaluate the data that trains the next generation of AI.


What You Will Do (The Contract)

  • Architect AI-Scale Systems: Design the overarching data architecture and processing topology needed to programmatically curate and shape datasets at TB/PB scale, ensuring low latency and high consistency.
  • Hands-On Development: Write production-grade code (Python/Scala, Spark) to build custom ingestion logic, highly efficient transformation scripts, and performant data validation checks.
  • Complex Data Logic: Implement advanced filtering, deduplication, and quality-scoring algorithms at scale, ensuring the resulting data objects are optimized for LLM/ML consumption.
  • Quality & Performance Tuning: Rigorously test, benchmark, and optimize processing workloads (CPU/memory tuning, partitioning strategies in Spark/Iceberg) to meet aggressive throughput targets.
  • Domain Subject Matter Expert: Act as the ultimate technical authority on distributed systems, data processing, and cloud structures to ensure the training data factory meets enterprise-grade accuracy.


What You Bring to the Table (Your Past Experience)

To be successful in this contract, you must have a track record of:

  • End-to-End Ownership: Designing and owning enterprise data platforms (batch + streaming).
  • High-Throughput Processing: Building and operating Kafka-first streaming pipelines.
  • Lakehouse Architecture: Utilizing Apache Iceberg, Delta Lake, or Hudi for analytics and ML at scale.
  • Reliability Engineering: Ensuring data reliability through SLAs, monitoring, backfills, and recovery.
  • Scale: Processing billions of events and managing TB–PB scale data systems.


Required Qualifications (Non-Negotiable)

  • Experience: 6+ years of Data Engineering experience.
  • Seniority: Demonstrated Senior/Staff-level ownership of production data platforms.
  • Pedigree: Background at Tier-1 enterprises (FAANG, large SaaS, Fortune 100).
  • Technical Stack: Deep fluency in Python/Scala, Spark, Kafka, Airflow, and Major Cloud Warehouses (Snowflake, BigQuery, Redshift).

Apply for this Position

Ready to join ? Click the button below to submit your application.

Submit Application