Integrating AI Evaluation Tools with TestRail

AI teams today use a growing ecosystem of tools to evaluate the quality of their applications.

Whether you’re using Langfuse, LangSmith, DeepEval, OpenAI Evals, Arize Phoenix, or a custom evaluation pipeline, these tools help generate valuable quality signals about your AI system’s performance.

The challenge is that evaluation results often remain isolated within those platforms.

With TestRail’s AI Evaluation Template and APIs, teams can automatically push evaluation results into TestRail, creating a centralized location for tracking AI quality alongside the rest of their testing and release processes.

Use the tools you already have

TestRail doesn't replace your AI observability or evaluation platform. Instead, it provides a common destination for storing, reviewing, and reporting on evaluation results regardless of where they originate.

 

The AI Quality Stack

Most organizations already have tooling across multiple layers of their AI stack.

 

Application Layer
Model / Agent Layer
Observability Layer
Evaluation Layer
Quality Management Layer
Where TestRail centralizes AI evaluation outcomes

 

Most AI evaluation platforms focus on evaluating outputs, not managing test cases. While they can score responses, benchmark prompts, and compare models, they typically don’t provide a structured way to organize evaluations into test suites, track coverage, or connect results to broader quality and release processes.

By pushing evaluation results into TestRail, teams can associate AI evaluation outcomes with structured test cases, making it easier to track coverage, compare results over time, and manage AI quality alongside the rest of their testing activities.

 

What Can Be Sent to TestRail?

Evaluation tools typically generate information such as:

  • Prompts and inputs
  • AI-generated outputs
  • Trace URLs
  • Human review scores
  • LLM-as-a-Judge scores
  • Accuracy metrics
  • Safety assessments
  • Latency measurements
  • Evaluation scores

Using TestRail APIs, this data can be attached directly to test results created with the AI Evaluation Template.

 

TestRail Field Evaluation Data
Input User prompt
Output Model response
Traces Trace URL
Latency Response time
Quality Rating Fully customizable quality categories
Comment Evaluation reasoning

 

An example of a script to upload the data into your Test Case that uses the AI Eval Template:

TestRail API Result Submission
TESTRAIL_URL = os.getenv("TESTRAIL_URL")
TESTRAIL_EMAIL = os.getenv("TESTRAIL_EMAIL")
TESTRAIL_API_KEY = os.getenv("TESTRAIL_API_KEY")
TESTRAIL_PROJECT_ID = os.getenv("TESTRAIL_PROJECT_ID")

# ==========================================
# TESTRAIL RESULT ENDPOINT
# ==========================================

result_endpoint = (
    f"{TESTRAIL_URL}/index.php?/api/v2/"
    f"add_result_for_case/{TESTRAIL_RUN_ID}/{case_id}"
)

# ==========================================
# TESTRAIL PAYLOAD
# ==========================================

payload = {
    "status_id": status_id,

    "comment": (
        f"Automated AI Evaluation Result\n\n"
        f"Trace Name: {trace['name']}\n"
        f"Latency: {latency} seconds\n\n"
        f"Notes:\n{notes}"
    ),

    "custom_ai_input": prompt,
    "custom_ai_output": response_output,
    "custom_ai_traces": trace_url,
    "custom_ai_latency": str(latency),
    "quality_rating": filtered_ratings
}

# ==========================================
# SEND RESULT TO TESTRAIL
# ==========================================

result = requests.post(
    result_endpoint,
    json=payload,
    auth=(TESTRAIL_EMAIL, TESTRAIL_API_KEY),
    headers={
        "Content-Type": "application/json"
    }
)

print(f"TestRail Status: {result.status_code}")

if result.status_code == 200:
    print("Result successfully added to TestRail")
else:
    print("ERROR sending result")
    print(result.text)

 

Example Workflow

 

How AI Evaluation Results Flow Into TestRail

AI Application
Evaluation Tool
Langfuse, Promptfoo, DeepEval, etc.
Human Review
or
LLM Judge
TestRail
AI Evaluation Results

 

Screenshot 2026-04-22 at 12.43.56.png

 

Why Integrate AI Evaluations with TestRail?

AI evaluation platforms are excellent at generating scores, traces, and benchmark results, but they typically aren’t designed to manage test cases or broader testing workflows.

By bringing evaluation results into TestRail, teams can connect AI quality signals with the structured testing processes they already use to manage software quality.

Benefits include:

  • Associate evaluation results with test cases to create a repeatable and organized AI testing process.
  • Track coverage across prompts, scenarios, and use cases rather than evaluating outputs in isolation.
  • Combine manual testing, automation, and AI evaluations within the same test suites and test plans.
  • Compare models, prompts, agents, and evaluation strategies using a consistent testing framework.
  • Maintain a historical record of evaluation outcomes linked to specific tests, releases, and milestones.
  • Support both human reviews and LLM-as-a-Judge workflows without creating separate reporting processes.
  • Use existing TestRail dashboards and reports to monitor quality trends and identify regressions over time.
Screenshot 2026-04-17 at 14.55.30.png

 

Flexible by Design

Because integrations are built using TestRail APIs, teams can connect virtually any AI observability or evaluation platform.

Whether results come from Langfuse traces, Promptfoo benchmarks, DeepEval metrics, OpenAI Evals, human reviewers, or proprietary evaluation systems, the integration pattern remains the same:

  1. Execute a test scenario.
  2. Generate evaluation results.
  3. Push the results into TestRail.
  4. Track outcomes alongside your existing test cases, test suites, and release activities.

The result is a more structured approach to AI quality where evaluation results are no longer disconnected from the test cases they relate to, making it easier to measure coverage, track quality over time, and manage AI testing as part of the broader software delivery process.

Was this article helpful?
0 out of 0 found this helpful