AI teams today use a growing ecosystem of tools to evaluate the quality of their applications.
Whether you’re using Langfuse, LangSmith, DeepEval, OpenAI Evals, Arize Phoenix, or a custom evaluation pipeline, these tools help generate valuable quality signals about your AI system’s performance.
The challenge is that evaluation results often remain isolated within those platforms.
With TestRail’s AI Evaluation Template and APIs, teams can automatically push evaluation results into TestRail, creating a centralized location for tracking AI quality alongside the rest of their testing and release processes.
Use the tools you already have
TestRail doesn't replace your AI observability or evaluation platform. Instead, it provides a common destination for storing, reviewing, and reporting on evaluation results regardless of where they originate.
The AI Quality Stack
Most organizations already have tooling across multiple layers of their AI stack.
Most AI evaluation platforms focus on evaluating outputs, not managing test cases. While they can score responses, benchmark prompts, and compare models, they typically don’t provide a structured way to organize evaluations into test suites, track coverage, or connect results to broader quality and release processes.
By pushing evaluation results into TestRail, teams can associate AI evaluation outcomes with structured test cases, making it easier to track coverage, compare results over time, and manage AI quality alongside the rest of their testing activities.
What Can Be Sent to TestRail?
Evaluation tools typically generate information such as:
- Prompts and inputs
- AI-generated outputs
- Trace URLs
- Human review scores
- LLM-as-a-Judge scores
- Accuracy metrics
- Safety assessments
- Latency measurements
- Evaluation scores
Using TestRail APIs, this data can be attached directly to test results created with the AI Evaluation Template.
| TestRail Field | Evaluation Data |
|---|---|
| Input | User prompt |
| Output | Model response |
| Traces | Trace URL |
| Latency | Response time |
| Quality Rating | Fully customizable quality categories |
| Comment | Evaluation reasoning |
An example of a script to upload the data into your Test Case that uses the AI Eval Template:
TESTRAIL_URL = os.getenv("TESTRAIL_URL")
TESTRAIL_EMAIL = os.getenv("TESTRAIL_EMAIL")
TESTRAIL_API_KEY = os.getenv("TESTRAIL_API_KEY")
TESTRAIL_PROJECT_ID = os.getenv("TESTRAIL_PROJECT_ID")
# ==========================================
# TESTRAIL RESULT ENDPOINT
# ==========================================
result_endpoint = (
f"{TESTRAIL_URL}/index.php?/api/v2/"
f"add_result_for_case/{TESTRAIL_RUN_ID}/{case_id}"
)
# ==========================================
# TESTRAIL PAYLOAD
# ==========================================
payload = {
"status_id": status_id,
"comment": (
f"Automated AI Evaluation Result\n\n"
f"Trace Name: {trace['name']}\n"
f"Latency: {latency} seconds\n\n"
f"Notes:\n{notes}"
),
"custom_ai_input": prompt,
"custom_ai_output": response_output,
"custom_ai_traces": trace_url,
"custom_ai_latency": str(latency),
"quality_rating": filtered_ratings
}
# ==========================================
# SEND RESULT TO TESTRAIL
# ==========================================
result = requests.post(
result_endpoint,
json=payload,
auth=(TESTRAIL_EMAIL, TESTRAIL_API_KEY),
headers={
"Content-Type": "application/json"
}
)
print(f"TestRail Status: {result.status_code}")
if result.status_code == 200:
print("Result successfully added to TestRail")
else:
print("ERROR sending result")
print(result.text)
Example Workflow
How AI Evaluation Results Flow Into TestRail
|
AI Application
|
→ |
Evaluation Tool
Langfuse, Promptfoo, DeepEval, etc. |
→ |
Human Review
or LLM Judge |
→ |
TestRail
AI Evaluation Results |
Why Integrate AI Evaluations with TestRail?
AI evaluation platforms are excellent at generating scores, traces, and benchmark results, but they typically aren’t designed to manage test cases or broader testing workflows.
By bringing evaluation results into TestRail, teams can connect AI quality signals with the structured testing processes they already use to manage software quality.
Benefits include:
- Associate evaluation results with test cases to create a repeatable and organized AI testing process.
- Track coverage across prompts, scenarios, and use cases rather than evaluating outputs in isolation.
- Combine manual testing, automation, and AI evaluations within the same test suites and test plans.
- Compare models, prompts, agents, and evaluation strategies using a consistent testing framework.
- Maintain a historical record of evaluation outcomes linked to specific tests, releases, and milestones.
- Support both human reviews and LLM-as-a-Judge workflows without creating separate reporting processes.
- Use existing TestRail dashboards and reports to monitor quality trends and identify regressions over time.
Flexible by Design
Because integrations are built using TestRail APIs, teams can connect virtually any AI observability or evaluation platform.
Whether results come from Langfuse traces, Promptfoo benchmarks, DeepEval metrics, OpenAI Evals, human reviewers, or proprietary evaluation systems, the integration pattern remains the same:
- Execute a test scenario.
- Generate evaluation results.
- Push the results into TestRail.
- Track outcomes alongside your existing test cases, test suites, and release activities.
The result is a more structured approach to AI quality where evaluation results are no longer disconnected from the test cases they relate to, making it easier to measure coverage, track quality over time, and manage AI testing as part of the broader software delivery process.