The AI Evaluation template introduces a structured way to test systems where outcomes are not strictly deterministic.
It is designed for evaluating AI-powered features, while also supporting other scenarios where quality must be assessed across multiple dimensions rather than a simple pass or fail.
The template integrates with existing TestRail functionality. Test runs, plans, milestones, reports, integrations and API all behave as they do today.
When to use this template
Use the AI Evaluation template when you need to:
Evaluate AI-generated outputs (for example, chatbot responses or recommendations)
Assess behaviour that varies between executions
Measure quality across multiple dimensions (for example, accuracy, relevance, or safety)
For non-AI scenarios, you can reuse the Quality Rating result field with other templates to evaluate:
Performance (for example, perceived responsiveness or degradation under load)
Security and compliance (for example, prompt injection resistance or data leakage)
Any system where quality cannot be reduced to a binary outcome
Creating AI Evaluation test cases
When creating a test case, select AI Evaluation from the Template dropdown.
The template includes all standard TestRail system fields:
Title (required)
Section (required)
Template (required)
Type (required)
Priority (required)
Status (Enterprise only)
Assigned To (Enterprise only)
It also supports structured steps with expected results.
Additional case fields
The template introduces optional fields to describe the system under test:
AI Type (dropdown)
Used to categorise the system (for example, RAG, ML, LLM)AI Model (dropdown)
Used to identify the model or system version (for example, GPT, Gemini)
These fields can also be used for broader classification in non-AI scenarios.
Logging results
In addition to existing fields (Status, Comment, Defects, etc.), the following fields are available:
Quality Rating (required)
Use this to evaluate the overall quality of the AI output across defined criteria (e.g. accuracy, relevance, completeness). This provides a consistent, qualitative assessment of the result.Input (optional)
Capture the user or system input provided to the AI (e.g. prompts, parameters, boosts). This helps with reproducibility and debugging.Output (optional)
Record the AI-generated output (e.g. chatbot response, generated content). This allows for direct comparison against expected behaviour.Traces (optional)
Provide a URL to trace logs from external observability tools (e.g. Langfuse). This is useful for deeper analysis and troubleshooting.Latency (optional)
Record the response time of the AI system. This helps monitor performance and identify potential bottlenecks.
All fields behave consistently with existing TestRail result fields.
Quality Rating
Our new Quality Rating results field enables structured evaluation using a star-based system.
Each category can be rated from 0 to 5 stars
Categories are configurable in Administration → Customisation → Result Fields
A maximum of 15 categories is supported
At least one category must be rated (≥1 star) before saving a result
Typical categories include:
Factual accuracy
Relevance to user intent
Reasoning coherence
Security and compliance
Response consistency
Categories are fully customisable and can be adapted to different testing needs.
Reusing Quality Rating across templates
Quality Rating is not limited to the AI Evaluation template.
Administrators can add this result field to other templates to support consistent evaluation across:
AI testing
Performance testing
Security testing
and anything else that requires more qualitative insights
When used in other templates, the same rating behaviour and dashboards apply.
Quality Insights dashboard
When a test run includes tests with the Quality Rating field, a Quality Insights section becomes available.
The dashboard includes:
Average Quality Score
Overall average rating across all results (out of 5)Tests with results
Percentage of tests with a final statusQuality by Category
Average rating per category, based on configured order
Additional behaviour:
Visible only when the Quality Rating field is present in the run
Automatically hidden if no such tests exist
Data refreshes hourly, with manual refresh available
Dashboard is read-only
Export available as PDF
Filtering and sorting
When the Quality Rating field is present in a test run:
A Category filter is available in Tests & Results
A Quality Rating sort option is available (default: high to low)
A Quality Rating column is displayed
Behaviour:
Filtering by category updates both the test list and displayed ratings
The column shows the average rating per test
Hovering shows ratings for all categories
Controls are hidden if no Rating-type field exists in the run
Template configuration
The AI Evaluation template is available under:
Administration → Customisation → Templates
By default:
Available to all projects
Not set as the default template
Fully editable
Administrators can:
Rename the template
Restrict it to selected projects
Add or remove supported case fields
Reorder fields
Limitations:
The Quality Rating result field cannot be removed
System field names are reserved
Fields marked as “applies to all templates” are automatically included
Compatibility
The AI Evaluation template is fully additive.
It does not change or impact:
Existing templates
Case and result fields
Test execution workflows
Test runs and plans
Integrations and defect tracking
API behaviour
CLI - coming soon!
TestRail Academy
AI Testing, Reimagined
AI doesn’t give the same answer twice, so testing it like traditional software no longer works. In this course, you’ll learn how to evaluate AI outputs across multiple quality dimensions, use LLM-as-judge to scale your testing, and build a repeatable framework for continuous evaluation. Stop guessing. Start measuring what actually matters.
Open Academy Course