LiveBench: A New Standard for Evaluating Large Language Models

LiveBench: The New Standard for Evaluating Large Language Models

The development of large language models (LLMs) has led to a surge in innovation, but it has also created a challenge in evaluating their performance. Traditional machine learning benchmark frameworks are no longer sufficient to evaluate new models, and existing LLM benchmarks have serious limitations. To address this, a team of researchers from Nvidia, Abacus.ai, New York University, the University of Maryland, and the University of Southern California has developed LiveBench, a new benchmark that offers contamination-free test data and objective scoring.

Caption: The latest news and updates on the large language modeling ecosystem.

LiveBench: A Game-Changer in LLM Evaluation

LiveBench is a general-purpose LLM benchmark that utilizes frequently updated questions from recent sources, scoring answers automatically according to objective ground-truth values. It contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. The release of LiveBench is especially notable because one of its contributors is Yann LeCun, a pioneer in the world of AI and Meta’s chief AI scientist.

Existing LLM benchmarks have serious limitations. They are typically published on the internet, and most modern LLMs include large swaths of the internet in their training data. If the LLM has seen the questions of a benchmark during training, its performance on that benchmark will be artificially inflated, making many LLM benchmarks unreliable.

“Like many in the community, we knew that we needed better LLM benchmarks because existing ones don’t align with our qualitative experience using LLMs,” Goldblum tells VentureBeat in an email.

LiveBench: What You Need to Know

LiveBench is releasing new questions every month to minimize potential test data contamination. These queries are sourced using recently released datasets and math competitions, arXiv papers, news articles, and IMDb movie synopses. Because each question has a verifiable and objective ground-truth answer, it can be scored accurately and automatically without needing LLM judges.

Tasks and Categories

An initial set of 18 tasks across six categories is available today. They’re tasks that use “a continuously updated information source for their questions” or are “more challenging or diverse versions of existing benchmark tasks.” Here’s the breakdown of tasks by categories:

Math: questions from high school math competitions from the past 12 months, as well as harder versions of AMPS questions
Coding: code generation and a novel code completion task
Reasoning: challenging versions of Big-Bench Hard’s Web of Lies and positional reasoning from bAbl and Zebra Puzzles
Language Comprehension: three tasks featuring Connections word puzzles, a typo removal task, and a movie synopsis unscrambling task from recent movies featured on IMDb and Wikipedia
Instruction Following: four tasks to paraphrase, simplify, summarize, or generate stories about recent articles from The Guardian while adhering to requirements such as word limits or incorporating specific elements in the response
Data Analysis: three tasks that use recent datasets from Kaggle and Socrata, namely table reformatting, predicting which columns can be used to join two tables, and predicting the correct type annotation of a data column

Each task differs in difficulty level, from easy to most challenging. The idea is that top models will tend to have a 30 percent to 70 percent success rate.

Caption: The benchmark’s creators say they have evaluated many “prominent closed-source models, as well as dozens of open-source models” between 500 million and 110 billion tokens in size.

What It Means for the Enterprise

Business leaders have it rough contemplating how to use AI and develop a sound strategy using the technology. Asking them to decide on the right LLMs adds unnecessary stress to the equation. Benchmarks can provide some peace of mind that models have exceptional performance—similar to product reviews. But are executives given the complete picture of what’s under the hood?

“Navigating all the different LLMs out there is a big challenge, and there’s unwritten knowledge regarding what benchmark numbers are misleading due to contamination, which LLM-judge evals are super biased, etc.,” Goldblum states. “LiveBench makes comparing models easy because you don’t have to worry about these problems. Different LLM use-cases will demand new tasks, and we see LiveBench as a framework that should inform how other scientists build out their own evals down the line.”

Comparing LiveBench to Other Benchmarks

Declaring you have a better evaluation standard is one thing, but how does it compare to benchmarks the AI industry has used for some time? The team looked into it, seeing how LiveBench’s score matched with prominent LLM benchmarks, namely LMSYS’s Chatbot Arena and Arena-Hard. It appears that LiveBench had “generally similar” trends to its industry peers, though some models were “noticeably stronger on one benchmark versus the other, potentially indicating some downsides of LLM judging.”