Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Harsh Raj, Niranjan Krishna, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn and Subhabrata Majumdar

2026

Working Paper No

741

Body

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

Key words

AI agent evaluation, consistency testing, U-statistics, Maximum Mean Discrepancy, perturbation robustness, kernel methods

WP No. 741.pdf (1.1 MB)

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Author(s) Name: Harsh Raj, Niranjan Krishna, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn and Subhabrata Majumdar, 2026

Working Paper No : 741

Abstract:

Keywords: AI agent evaluation, consistency testing, U-statistics, Maximum Mean Discrepancy, perturbation robustness, kernel methods

WP No. 741.pdf (1.1 MB)

Certificate Programmes

UG Programmes

CENTRES OF EXCELLENCE

IIMB Management Review

Journal of Indian Institute of Management Bangalore

CENTRES OF EXCELLENCE

Centres Of Excellence

Certificate Programmes

UG Programmes

Faculty

IIMB Institutional Review Board (IRB)

IIMB Institutional Review Board (IRB)

IIMB Management Review

Journal of Indian Institute of Management Bangalore

About IIMB

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Contact us