AI Researchers Develop LiveBench, a New Benchmark for Evaluating Language Models

Researchers have developed a new benchmark, LiveBench, to evaluate language models' question-answering capabilities, addressing contamination and accuracy issues in existing benchmarks.
AI Researchers Develop LiveBench, a New Benchmark for Evaluating Language Models
Photo by Ismail Salad Osman Hajji dirir on Unsplash

AI Researchers Develop LiveBench, a New Benchmark for Evaluating Language Models

AI researchers have developed a new benchmark, LiveBench, to evaluate language models’ question-answering capabilities.

A group of researchers has developed a new benchmark, dubbed LiveBench, to ease the task of evaluating large language models’ question-answering capabilities. The researchers released the benchmark on Wednesday under an open-source license. The project was sponsored by Abacus.AI Inc., a venture-backed artificial intelligence startup, and included the participation of Turing Award-winning computer scientist Yann LeCun.

“One weakness is that some types of questions do not have ground-truth answers, such as ‘write a travel guide to Hawaii.’” - LiveBench creators

LiveBench is designed to address two challenges that the researchers have identified in existing LLM evaluation benchmarks. The first is a phenomenon known as contamination. The other is that software teams often evaluate LLMs’ question-answering prowess using another LLM, which can lead to accuracy issues.

Language models are often trained on large amounts of publicly available web content.

Language models are often trained on large amounts of publicly available web content. In many cases, that content includes answers to questions from popular AI evaluation benchmarks. If an LLM has the answers to a benchmark, it can “cheat” during evaluations, which means the benchmark results won’t accurately reflect its capabilities. This phenomenon is known as contamination in the machine learning ecosystem.

According to LiveBench’s creators, the newly released benchmark can avoid contamination during LLM evaluations. It does so by providing neural networks with tasks to which the answers are unlikely to be included in their training datasets. For added measure, the researchers will regularly refresh LiveBench’s task collection to address the fact that LLMs might eventually obtain answers to the current questions.

During AI accuracy evaluations, language models’ answers to the questions in a benchmark often aren’t scored manually. Instead, researchers use an external LLM such as GPT-4 to check the responses. LiveBench’s creators argue that this approach has limitations because LLMs often make mistakes while evaluating other neural networks’ benchmark responses.

The current version of LiveBench includes 960 questions across six categories: reasoning, data analysis, math, coding, language comprehension, and instruction following. Some of the questions are more challenging versions of test content from existing AI benchmarks. LiveBench’s other tasks change regularly based on information added to frequently-updated public data sources such as arXiv, a popular repository of research papers.

An AI benchmark is a collection of questions used to test neural networks’ knowledge of a given topic.

An AI benchmark is a collection of questions used to test neural networks’ knowledge of a given topic. Some benchmarks also contain other types of tasks, such as prompts instructing an LLM to debug a code file. By checking how many of the tasks the LLM performs correctly, researchers can gain a better understanding of its capabilities and limitations.