Centres Of Excellence

To focus on new and emerging areas of research and education, Centres of Excellence have been established within the Institute. These ‘virtual' centres draw on resources from its stakeholders, and interact with them to enhance core competencies

Read More >>

Faculty

Faculty members at IIMB generate knowledge through cutting-edge research in all functional areas of management that would benefit public and private sector companies, and government and society in general.

Read More >>

IIMB Management Review

Journal of Indian Institute of Management Bangalore

IIM Bangalore offers Degree-Granting Programmes, a Diploma Programme, Certificate Programmes and Executive Education Programmes and specialised courses in areas such as entrepreneurship and public policy.

Read More >>

About IIMB

The Indian Institute of Management Bangalore (IIMB) believes in building leaders through holistic, transformative and innovative education

Read More >>

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Harsh Raj, Niranjan Krishna, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn and Subhabrata Majumdar
2026
Working Paper No
741
Body

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

Key words
AI agent evaluation, consistency testing, U-statistics, Maximum Mean Discrepancy, perturbation robustness, kernel methods
WP No. 741.pdf (1.1 MB)

Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

Author(s) Name: Harsh Raj, Niranjan Krishna, Suvrorup Mukherjee, Aritra Guha, Cheryl Flynn and Subhabrata Majumdar, 2026
Working Paper No : 741
Abstract:

This paper establishes a rigorous measurement science for AI agent reliability, providing a foundational framework for quantifying consistency under semantically preserving perturbations. By leveraging U-statistics for output-level reliability and kernel-based metrics for trajectory-level stability, we offer a principled approach to evaluating agents across diverse operating conditions. Our proposal highlights the important distinction between the core capability and execution robustness of an agent, showing that minor task-level variations can induce complete strategy breakdowns despite the agent possessing the requisite knowledge for the task. We validate our framework through extensive experiments on three agentic benchmarks, demonstrating that trajectory-level consistency metrics provide far greater diagnostic sensitivity than traditional pass@1 rates. By providing the mathematical tools to isolate where and why agents deviate, we enable the identification and rectification of architectural concerns that hinder the deployment of agents in high-stakes, real-world environments.

Keywords: AI agent evaluation, consistency testing, U-statistics, Maximum Mean Discrepancy, perturbation robustness, kernel methods
WP No. 741.pdf (1.1 MB)