$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Fork Notice
This repository is a modified fork of the original tau2-bench, adapted to run experiments with the Cleanlab Trustworthy Language Model (TLM).
It also includes the experiment log files used in the blog
“Automated Hallucination Correction for AI Agents: A Case Study on Tau²-Bench.”

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Figure 1: τ²-bench allows users to interact with the agent and the environment

Figure 2: Trajectory of a conversation between an agent and a user

Important Note

This repository needs a unique setup. Follow the instructions below to get started, but there are some caveats.

Instead of running tau2 ... you must run python src/cli.py ...
After you install packages, you must uninstall the tau2 package with pip
Add the .env file with OPENAI_API_KEY and CLEANLAB_TLM_API_KEY

Overview

$\tau^2$-bench implements a simulation framework for evaluating customer service agents across various domains.

Each domain specifies:

a policy that the agent must follow
a set of tools that the agent can use
a set of tasks to evaluate the agent's performance
Optionally: A set of tools that the user simulator can use

Domains are:

mock
airline
retail
telecom

All the information that an agent developer needs to build an agent for a domain can be accessed through the domain's API docs. See View domain documentation for more details.

Installation

Clone the repository:

git clone https://github.com/sierra-research/tau2-bench
cd tau2-bench

Create a new environment (optional)

$\tau^2$-bench requires Python 3.10 or higher. You may create and activate a new environment:

python -m venv .venv
source .venv/bin/activate

Install tau2

pip install .

This will enable you to run the tau2 command.

To remove all the generated files and the virtual environment, run:

make clean

Quick Start

Setup LLM API keys

We use LiteLLM to manage LLM APIs, so you can use any LLM provider supported by LiteLLM.

To provide your API keys, copy .env.example as .env and edit it to include your API keys.

Run agent evaluation

To run a test evaluation on only 5 tasks with 1 trial per task, run:

tau2 run \
--domain airline \
--agent-llm gpt-4.1 \
--user-llm gpt-4.1 \
--num-trials 1 \
--num-tasks 5

Results will be saved in data/tau2/simulations/.

Command Line Interface

The tau2 command provides a unified interface for all functionality:

Running Benchmark

tau2 run \
  --domain <domain> \
  --agent <agent_name> \
  --agent-llm <llm_name> \
  --user-llm <llm_name> \
  --num-trials <trial_count> \
  --task-ids <task_ids> \
  --max-concurrency <concurrent_sims> \
  ...

The agent name for the TLM pipeline is tlm_agent.

Viewing Results

tau2 view

If you want to view our previous results, you should first move data/tau2/results/report/tlm_commenting.json to data/simulations/tlm_commenting.json. Note that tau2's viewer does not show trustworthiness scores, but they are in the JSON file if you want to inspect them.

This tool allows you to:

Browse simulation files (in data/tau2/simulations/)
View agent performance metrics
View a particular simulation
View task details

View domain documentation

tau2 domain <domain>

Visit http://127.0.0.1:8004/redoc to see the domain policy and API documentation.

Experiments

Running Ablation Studies (No User, or Agent with Oracle Plan)

telecom domain enables running ablation studies.

Running an LLM in no-user mode. In this mode, the LLM is given all the tools and the information upfront. Just choose llm_agent_solo as the agent and dummy_user as the user.

tau2 run \
  --domain telecom \
  --agent llm_agent_solo \
  --agent-llm gpt-4.1 \
  --user dummy_user \
  ...

Running an LLM in oracle-plan mode. In this mode, the LLM is given an oracle plan ahead of time alleviating the need for action planning. Just choose llm_agent_gt as the agent.

tau2 run \
  --domain telecom \
  --agent llm_agent_gt \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  ...

Running Telecom Domain with Workflow Policy

To test the impact of policy format, we provide an additional "workflow" policy for the telecom domain. To run using this policy, use the telecom-workflow domain.

tau2 run \
  --domain telecom-workflow \
  --agent-llm gpt-4.1 \
  --user-llm gpt-4.1 \
  ...

Domains

For all the details see the domains README.

Basics

Code is located in src/tau2/domains/
Data is located in data/tau2/domains/
Each domain has its own configuration and task definitions

View domain-specific policy and API docs:

Run the following command to see the domain policy and API documentation.

tau2 env <domain>

Then visit http://127.0.0.1:8004/redoc

Environment CLI (beta)

An interactive command-line interface for directly querying and testing domain environments. Features:

Interactive query interface with domain-specific tools
Support for multiple domains (airline, mock, etc.)
Session management with history

To use:

make env-cli

Available commands:

:q - quit the program
:d - change domain
:n - start new session (clears history)

Example usage:

$ make env-cli

Welcome to the Environment CLI!
Connected to airline domain.

Query (:n new session, :d change domain, :q quit)> What flights are available from SF to LA tomorrow?
Assistant: Let me check the flight availability for you...
[Flight details will appear here]

The Environment CLI is useful for:

Testing domain tools and queries
Debugging environment responses
Exploring available domain functionality
Quick domain interaction without starting the full server stack

Run tests

To run the test suite use the command

make test

Config

To configure the framework, see the config file.

LLM Calls caching

LLM call caching is disabled by default.

To enable LLM calls caching: - Make sure redis is running. - Update the redis config in config.py if necessary. - Set LLM_CACHE_ENABLED to True in config.py

Evaluate Your Own Agent

For local or remote agent evaluation, see our agent developer guide.

Orchestration Sequence Diagram

sequenceDiagram
    participant O as Orchestrator
    participant A as Agent
    participant U as UserSimulator
    participant E as Environment

    Note over O: Initialize(task)
    rect rgb(100, 150, 150)
        O->>A: get_init_state_info(message_history)
        A->>O: agent_state_info
        O->>U: get_init_state_info(message_history)
        U->>O: user_state_info
        O->>E: set_state(initialization_data, initialization_actions, message_history)
    end
    Note over O: Start simulation
    loop Pass messages between Agent, User, and Environment

        alt Agent/Env to User
            rect rgb(200, 150, 150)
            O->>U: generate_next_message(msg, user_state_info)
            U-->>O: (user_msg, user_state_info)
            end
            Note over O: Check if user_msg is STOP
        else User/Env to Agent
            rect rgb(100, 200, 100)
            O->>A: generate_next_message(msg, agent_state_info)
            A-->>O: (assistant_msg, agent_state_info)
            Note over O: Check if too many errors
            end
        else User/Agent to Environment
            rect rgb(150, 150, 200)
            O->>E: get_response(tool_call)
            E-->>O: tool_message
            end
        end
        Note over O: Check if max turns reached.
    end
    Note over O: Return simulation run

Citation

@misc{barres2025tau2,
      title={$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment},
      author={Victor Barres and Honghua Dong and Soham Ray and Xujie Si and Karthik Narasimhan},
      year={2025},
      eprint={2506.07982},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.07982},
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github		.github
data		data
figs		figs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.py		config.py
escalation_data.py		escalation_data.py
pdm.lock		pdm.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Important Note

Overview

Installation

Quick Start

Setup LLM API keys

Run agent evaluation

Command Line Interface

Running Benchmark

Viewing Results

View domain documentation

Experiments

Running Ablation Studies (No User, or Agent with Oracle Plan)

Running Telecom Domain with Workflow Policy

Domains

Basics

View domain-specific policy and API docs:

Environment CLI (beta)

Run tests

Config

LLM Calls caching

Evaluate Your Own Agent

Orchestration Sequence Diagram

Citation

About

Uh oh!

Releases

Packages

Languages

License

paperwave/tau2-bench

Folders and files

Latest commit

History

Repository files navigation

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

Important Note

Overview

Installation

Quick Start

Setup LLM API keys

Run agent evaluation

Command Line Interface

Running Benchmark

Viewing Results

View domain documentation

Experiments

Running Ablation Studies (No User, or Agent with Oracle Plan)

Running Telecom Domain with Workflow Policy

Domains

Basics

View domain-specific policy and API docs:

Environment CLI (beta)

Run tests

Config

LLM Calls caching

Evaluate Your Own Agent

Orchestration Sequence Diagram

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages