Integrating AI Evaluation Tools with TestRail

Updated June 02, 2026 17:07

AI teams today use a growing ecosystem of tools to evaluate the quality of their applications.

Whether you’re using Langfuse, LangSmith, DeepEval, OpenAI Evals, Arize Phoenix, or a custom evaluation pipeline, these tools help generate valuable quality signals about your AI system’s performance.

The challenge is that evaluation results often remain isolated within those platforms.

With TestRail’s AI Evaluation Template and APIs, teams can automatically push evaluation results into TestRail, creating a centralized location for tracking AI quality alongside the rest of their testing and release processes.

Use the tools you already have

TestRail doesn't replace your AI observability or evaluation platform. Instead, it provides a common destination for storing, reviewing, and reporting on evaluation results regardless of where they originate.

The AI Quality Stack

Most organizations already have tooling across multiple layers of their AI stack.

Application Layer

↓

Model / Agent Layer

↓

Observability Layer

↓

Evaluation Layer

↓

Quality Management Layer

Where TestRail centralizes AI evaluation outcomes

Most AI evaluation platforms focus on evaluating outputs, not managing test cases. While they can score responses, benchmark prompts, and compare models, they typically don’t provide a structured way to organize evaluations into test suites, track coverage, or connect results to broader quality and release processes.

By pushing evaluation results into TestRail, teams can associate AI evaluation outcomes with structured test cases, making it easier to track coverage, compare results over time, and manage AI quality alongside the rest of their testing activities.

What Can Be Sent to TestRail?

Evaluation tools typically generate information such as:

Prompts and inputs
AI-generated outputs
Trace URLs
Human review scores
LLM-as-a-Judge scores
Accuracy metrics
Safety assessments
Latency measurements
Evaluation scores

Using TestRail APIs, this data can be attached directly to test results created with the AI Evaluation Template.

TestRail Field	Evaluation Data
Input	User prompt
Output	Model response
Traces	Trace URL
Latency	Response time
Quality Rating	Fully customizable quality categories
Comment	Evaluation reasoning

An example of a script to upload the data into your Test Case that uses the AI Eval Template:

TestRail API Result Submission

TESTRAIL_URL = os.getenv("TESTRAIL_URL")
TESTRAIL_EMAIL = os.getenv("TESTRAIL_EMAIL")
TESTRAIL_API_KEY = os.getenv("TESTRAIL_API_KEY")
TESTRAIL_PROJECT_ID = os.getenv("TESTRAIL_PROJECT_ID")

# ==========================================
# TESTRAIL RESULT ENDPOINT
# ==========================================

result_endpoint = (
    f"{TESTRAIL_URL}/index.php?/api/v2/"
    f"add_result_for_case/{TESTRAIL_RUN_ID}/{case_id}"
)

# ==========================================
# TESTRAIL PAYLOAD
# ==========================================

payload = {
    "status_id": status_id,

    "comment": (
        f"Automated AI Evaluation Result\n\n"
        f"Trace Name: {trace['name']}\n"
        f"Latency: {latency} seconds\n\n"
        f"Notes:\n{notes}"
    ),

    "custom_ai_input": prompt,
    "custom_ai_output": response_output,
    "custom_ai_traces": trace_url,
    "custom_ai_latency": str(latency),
    "quality_rating": filtered_ratings
}

# ==========================================
# SEND RESULT TO TESTRAIL
# ==========================================

result = requests.post(
    result_endpoint,
    json=payload,
    auth=(TESTRAIL_EMAIL, TESTRAIL_API_KEY),
    headers={
        "Content-Type": "application/json"
    }
)

print(f"TestRail Status: {result.status_code}")

if result.status_code == 200:
    print("Result successfully added to TestRail")
else:
    print("ERROR sending result")
    print(result.text)

Example Workflow

How AI Evaluation Results Flow Into TestRail

AI Application

→

Evaluation Tool
Langfuse, Promptfoo, DeepEval, etc.

→

Human Review
or
LLM Judge

→

TestRail
AI Evaluation Results

Why Integrate AI Evaluations with TestRail?

AI evaluation platforms are excellent at generating scores, traces, and benchmark results, but they typically aren’t designed to manage test cases or broader testing workflows.

By bringing evaluation results into TestRail, teams can connect AI quality signals with the structured testing processes they already use to manage software quality.

Benefits include:

Associate evaluation results with test cases to create a repeatable and organized AI testing process.
Track coverage across prompts, scenarios, and use cases rather than evaluating outputs in isolation.
Combine manual testing, automation, and AI evaluations within the same test suites and test plans.
Compare models, prompts, agents, and evaluation strategies using a consistent testing framework.
Maintain a historical record of evaluation outcomes linked to specific tests, releases, and milestones.
Support both human reviews and LLM-as-a-Judge workflows without creating separate reporting processes.
Use existing TestRail dashboards and reports to monitor quality trends and identify regressions over time.

Flexible by Design

Because integrations are built using TestRail APIs, teams can connect virtually any AI observability or evaluation platform.

Whether results come from Langfuse traces, Promptfoo benchmarks, DeepEval metrics, OpenAI Evals, human reviewers, or proprietary evaluation systems, the integration pattern remains the same:

Execute a test scenario.
Generate evaluation results.
Push the results into TestRail.
Track outcomes alongside your existing test cases, test suites, and release activities.

The result is a more structured approach to AI quality where evaluation results are no longer disconnected from the test cases they relate to, making it easier to measure coverage, track quality over time, and manage AI testing as part of the broader software delivery process.