Job Description
Overview
This is a contracting engagement - initially 6 months - with potential for long term engagement.
Location: Paris-based preferred; alternatively Europe remote for strong candidates.
What You’ll Do
- Create high-quality coding prompts and reference answers (benchmark-style, e.g. SWE-Bench-like problems).
- Evaluate LLM outputs for code generation, refactoring, debugging, and implementation tasks.
- Identify and document model failures, edge cases, and reasoning gaps.
- Perform head-to-head evaluations between private LLMs (Mistral-based) and leading external models.
- Build or configure coding environments to support evaluation and reinforcement learning (RL).
- Follow detailed annotation and evaluation guidelines with high consistency.
What We’re Looking For
- 10+ years of professional software development experience.
- Strong Python skills (required).
Apply for this Position
Ready to join Braintrust? Click the button below to submit your application.
Submit Application