Job Description

Functional AI Tester - GenAI

- - - - - - - - - - - -

About the Role

You will be involved in QA for GenAI features including Retrieval-Augmented Generation (RAG), conversational AI and Agentic evaluations. The role centers on:

  • Systematic GenAI evaluation (qualitative and quantitative metrics)

  • ETL and data quality testing for the data flows that feed AI systems

  • Python-driven automated testing

  • This position is hands-on and collaborative, partnering with AI engineers, data engineers, and product teams to define measurable acceptance criteria and ship high-quality AI features.

    Key Responsibilities

  • Test strategy and planning

  • Define risk-based test strategies and detailed test plans for GenAI features.

    Establish clear acceptance criteria with stakeholders for functional, safety, and data quality aspects.

  • Python test automation

  • Build and maintain automated test suites using Python (e.g., PyTest, requests).

    Implement reusable utilities for prompt/response validation, dataset management, and result scoring.

    Create regression baselines and golden test sets to detect quality drift.

  • GenAI evaluation

  • Develop evaluation harnesses covering factuality, coherence, helpfulness, safety, bias, and toxicity etc.

    Design prompt suites, scenario-based tests, and golden datasets for reproducible measurements.

    Implement guardrail tests including prompt-injection resilience, unsafe content detection, and PII redaction checks.

    Track quality metrics over time.

  • RAG and semantic retrieval testing

  • Verify alignment between retrieved sources and generated answers.

    Verify adversarial tests.

    Measure retrieval relevance, precision/recall, grounding quality, and hallucination reduction.

  • API and application testing

  • Test REST endpoints supporting GenAI features (request/response contracts, error handling, timeouts).

  • ETL and data quality validation

  • Test ingestion and transformation logic; validate schema, constraints, and field-level rules.

    Implement data profiling, reconciliation between sources and targets, and lineage checks.

    Verify data privacy controls, masking, and retention policies across pipelines.

  • Non-functional testing

  • Performance and load testing focused on latency, throughput, concurrency, and rate limits for LLM calls.

    Cost-aware testing (token usage, caching effectiveness) and timeout/retry behavior validation.

    Reliability and resilience checks including error recovery and fallback behavior.

  • Share results and insights; recommend remediation and preventive actions.

  • Required Qualifications

  • Experience

  • 5+ years in software QA, including test strategy, automation, and defect management.

    2+ years testing AI/ML or GenAI features, with hands-on evaluation design.

    4+ years testing ETL/data pipelines and data quality.

  • Technical skills

  • Python: Strong proficiency building automated tests and tooling (PyTest, requests, pydantic or similar).

    API testing: REST contract testing, schema validation, negative testing.

    GenAI evaluation: crafting prompt suites, golden datasets, rubric-based scoring, and automated evaluation pipelines.

    RAG testing: retrieval relevance, grounding validation, chunking/indexing verification, and embedding checks.

    ETL/data quality: schema and constraint validation, reconciliation, lineage awareness, data profiling.

  • Quality and governance

  • Understanding of LLM limitations and methods to detect/reduce hallucinations.

    Safety and compliance testing including PII handling and prompt-injection resilience.

    Strong analytical and debugging skills across services and data flows.

  • Soft skills

  • Excellent written and verbal communication; ability to translate quality goals into measurable criteria.

    Collaboration with AI engineers, data engineers, and product stakeholders.

    Organized, detail-oriented, and outcomes-focused.

  • Nice to Have

  • Experience with evaluation frameworks or tooling for LLMs and RAG quality measurement.

  • Experience creating synthetic datasets to stress specific behaviors.

  • Apply for this Position

    Ready to join ? Click the button below to submit your application.

    Submit Application