Job Description

Overview

This is a contracting engagement - initially 6 months - with potential for long term engagement.

Location: Paris-based preferred; alternatively Europe remote for strong candidates.

What You’ll Do

  • Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
  • Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
  • Identify and document model failures, edge cases, and reasoning gaps.
  • Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
  • Build or configure coding environments to support evaluation and reinforcement learning (RL).
  • Follow detailed annotation and evaluation guidelines with high consistency.

What We’re Looking For

  • 10+ years of professional software development experience.
  • Strong Python skills (required).

Apply for this Position

Ready to join Braintrust? Click the button below to submit your application.

Submit Application