AI Evaluation Template with Quality Insights Dashboard

The AI Evaluation template introduces a structured way to test systems where outcomes are not strictly deterministic.

It is designed for evaluating AI-powered features, while also supporting other scenarios where quality must be assessed across multiple dimensions rather than a simple pass or fail.

The template integrates with existing TestRail functionality. Test runs, plans, milestones, reports, integrations and API all behave as they do today.

When to use this template

Use the AI Evaluation template when you need to:

  • Evaluate AI-generated outputs (for example, chatbot responses or recommendations)

  • Assess behaviour that varies between executions

  • Measure quality across multiple dimensions (for example, accuracy, relevance, or safety)

For non-AI scenarios, you can reuse the Quality Rating result field with other templates to evaluate:

  • Performance (for example, perceived responsiveness or degradation under load)

  • Security and compliance (for example, prompt injection resistance or data leakage)

  • Any system where quality cannot be reduced to a binary outcome

Creating AI Evaluation test cases

When creating a test case, select AI Evaluation from the Template dropdown.

The template includes all standard TestRail system fields:

  • Title (required)

  • Section (required)

  • Template (required)

  • Type (required)

  • Priority (required)

  • Status (Enterprise only)

  • Assigned To (Enterprise only)

It also supports structured steps with expected results.

Additional case fields

The template introduces optional fields to describe the system under test:

  • AI Type (dropdown)
    Used to categorise the system (for example, RAG, ML, LLM)

  • AI Model (dropdown)
    Used to identify the model or system version (for example, GPT, Gemini)

These fields can also be used for broader classification in non-AI scenarios.

Logging results

In addition to existing fields (Status, Comment, Defects, etc.), the following fields are available:

  • Quality Rating (required)
    Use this to evaluate the overall quality of the AI output across defined criteria (e.g. accuracy, relevance, completeness). This provides a consistent, qualitative assessment of the result.

  • Input (optional)
    Capture the user or system input provided to the AI (e.g. prompts, parameters, boosts). This helps with reproducibility and debugging.

  • Output (optional)
    Record the AI-generated output (e.g. chatbot response, generated content). This allows for direct comparison against expected behaviour.

  • Traces (optional)
    Provide a URL to trace logs from external observability tools (e.g. Langfuse). This is useful for deeper analysis and troubleshooting.

  • Latency (optional)
    Record the response time of the AI system. This helps monitor performance and identify potential bottlenecks.

All fields behave consistently with existing TestRail result fields.

Screenshot 2026-04-22 at 12.43.56.png

 

Quality Rating

Our new Quality Rating results field enables structured evaluation using a star-based system.

  • Each category can be rated from 0 to 5 stars

  • Categories are configurable in Administration → Customisation → Result Fields

  • A maximum of 15 categories is supported

  • At least one category must be rated (≥1 star) before saving a result

Typical categories include:

  • Factual accuracy

  • Relevance to user intent

  • Reasoning coherence

  • Security and compliance

  • Response consistency

Categories are fully customisable and can be adapted to different testing needs.

Reusing Quality Rating across templates

Quality Rating is not limited to the AI Evaluation template.

Administrators can add this result field to other templates to support consistent evaluation across:

  • AI testing

  • Performance testing

  • Security testing

  • and anything else that requires more qualitative insights

When used in other templates, the same rating behaviour and dashboards apply.

Quality Insights dashboard

Screenshot 2026-04-17 at 14.55.30.png

When a test run includes tests with the Quality Rating field, a Quality Insights section becomes available.

The dashboard includes:

  • Average Quality Score
    Overall average rating across all results (out of 5)

  • Tests with results
    Percentage of tests with a final status

  • Quality by Category
    Average rating per category, based on configured order

Additional behaviour:

  • Visible only when the Quality Rating field is present in the run

  • Automatically hidden if no such tests exist

  • Data refreshes hourly, with manual refresh available

  • Dashboard is read-only

  • Export available as PDF

Filtering and sorting

When the Quality Rating field is present in a test run:

  • Category filter is available in Tests & Results

  • Quality Rating sort option is available (default: high to low)

  • Quality Rating column is displayed

Behaviour:

  • Filtering by category updates both the test list and displayed ratings

  • The column shows the average rating per test

  • Hovering shows ratings for all categories

  • Controls are hidden if no Rating-type field exists in the run

 

Template configuration

The AI Evaluation template is available under:

Administration → Customisation → Templates

By default:

  • Available to all projects

  • Not set as the default template

  • Fully editable

Administrators can:

  • Rename the template

  • Restrict it to selected projects

  • Add or remove supported case fields

  • Reorder fields

Limitations:

  • The Quality Rating result field cannot be removed

  • System field names are reserved

  • Fields marked as “applies to all templates” are automatically included

Compatibility

The AI Evaluation template is fully additive.

It does not change or impact:

  • Existing templates

  • Case and result fields

  • Test execution workflows

  • Test runs and plans

  • Integrations and defect tracking

  • API behaviour

  • CLI - coming soon!

TestRail Academy

AI Testing, Reimagined

AI doesn’t give the same answer twice, so testing it like traditional software no longer works. In this course, you’ll learn how to evaluate AI outputs across multiple quality dimensions, use LLM-as-judge to scale your testing, and build a repeatable framework for continuous evaluation. Stop guessing. Start measuring what actually matters.

Open Academy Course
Was this article helpful?
0 out of 0 found this helpful