The Future of Customer Interactions: Evaluating AI Agents with TAU-bench

AI agents are becoming increasingly popular, but how well do they perform in real-world settings? Sierra's new benchmark, TAU-bench, is designed to evaluate the performance of conversational AI agents in complex tasks.
The Future of Customer Interactions: Evaluating AI Agents with TAU-bench
Photo by Erwan Hesry on Unsplash

AI Agents: The Future of Customer Interactions?

The world of artificial intelligence (AI) is rapidly evolving, and one area that is gaining significant attention is the development of AI agents. These agents are designed to interact with humans in a more natural and conversational way, and they have the potential to revolutionize the way we interact with technology.

One company that is at the forefront of this technology is Sierra, a customer experience AI startup founded by OpenAI board member Bret Taylor and Google AR/VR veteran Clay Bavor. Sierra has developed a new benchmark called TAU-bench, which is designed to evaluate the performance of conversational AI agents in real-world settings.

AI-generated image depicting a complex conversation taking place on a smartphone.

TAU-bench is a novel approach to evaluating AI agents, as it focuses on their ability to complete complex tasks while having multiple exchanges with users. This is in contrast to traditional benchmarks, which often evaluate agents based on their ability to respond to a single question or task.

According to Karthik Narasimhan, Sierra’s head of research, TAU-bench is designed to provide a more realistic evaluation of AI agents. “At Sierra, our experience in enabling real-world user-facing conversational agents has made one thing extremely clear: a robust measurement of agent performance and reliability is critical to their successful deployment,” he said.

TAU-bench consists of several tasks that agents must complete, including working with realistic databases and tool APIs, following complex policies and rules, and communicating in realistic conversations. The benchmark is designed to be modular, making it easy to add new elements such as domains, database entries, rules, APIs, tasks, and evaluation metrics.

Example of an airline reservation agent in Sierra’s TAU-bench.

Sierra tested TAU-bench using 12 popular large language models (LLMs) from OpenAI, Anthropic, Google, and Mistral. The results showed that all of the agents had difficulties solving tasks, with the best-performing agent from OpenAI’s GPT-4o having a less than 50% average success rate across two domains.

Chart outlining how 12 popular LLMs performed under TAU-bench.

The results of the TAU-bench evaluation highlight the need for more advanced LLMs that can reason and plan more effectively. They also underscore the importance of developing more complex scenarios to test the abilities of AI agents.

In related news, you.com, a California-based AI firm, is reportedly seeking to raise $50 million in new capital to boost its AI assistants. The company, which has already answered over one billion queries, is looking to expand its capabilities in the increasingly competitive AI market.

The Global Telco AI Alliance, a joint venture between SK Telecom, Deutsche Telekom, e&, Singtel, and SoftBank Corp., is also working on developing multilingual large language models (Telco LLMs) tailored to the telecommunications industry’s needs. The alliance aims to develop AI applications that will enhance customer interactions via digital assistants and other innovative AI solutions.

Image of a complex conversation taking place on a smartphone.

As AI agents continue to evolve, we can expect to see significant advancements in the way we interact with technology. From customer service chatbots to virtual assistants, AI agents have the potential to revolutionize the way we live and work.