Evaluating the Performance of Long-Context Language Models: A New Frontier
The advent of large language models (LLMs) with very long context windows has revolutionized the field of artificial intelligence. These models have made it possible to create advanced AI applications with simple prompting techniques, eliminating the need for complex tools and pipelines. However, evaluating the performance of long-context LLMs is still an unexplored area that requires further investigation.
Image: AI research
In a recent paper, Google DeepMind introduced a benchmark called Long-Context Frontiers (LOFT) to rigorously evaluate the performance of long-context language models (LCML). LOFT has been designed for tasks with very long prompts, and it can be a great tool to evaluate and compare LLMs as context windows expand to millions of tokens.
Long-Context Language Models: A Game-Changer
The limited context window of LLMs previously required specialized techniques to customize the models for new tasks. For example, if the model cannot perform the task through few-shot learning, you need to fine-tune the LLM. And if you wanted to add proprietary information to the prompt, you would need to use a retrieval-augmented generation (RAG) pipeline to choose specific bits of information from your corpus that are relevant to the task.
With long-context language models, you can just insert your entire corpus or training examples into the prompt and have the model learn the task or choose the parts it needs to solve the problem. You can further increase the capabilities of the model using techniques such as adding instructions and chain-of-thought reasoning.
Image: Large Language Model
Current evaluation methods for LCLMs include the “needle-in-a-haystack” test and fixed-length datasets that haven’t been designed for long-context models. As the researchers write, “Critically, existing evaluations do not adequately stress-test LCLMs on any paradigm-shifting tasks.”
Long-Context Frontiers (LOFT): A Suite of Tasks
Long-Context Frontiers (LOFT) is a suite of six tasks consisting of 35 datasets that include text, visual, and audio modalities. It is designed to gauge LCLMs on real-world tasks. LOFT currently supports 32k-, 128k-, and 1M-token context windows. But it also supports the automatic creation of increasing context lengths as LCLMs continue to scale.
LOFT evaluates several important domains, including retrieval and RAG, SQL, and in-context learning. It aims to open up a new line of research on long-context prompting, which DeepMind introduces as “Corpus-in-Context” (CiC) Prompting.
Image: Corpus-in-Context
CiC combines several prompting strategies to activate the capabilities of LCLMs for learning, retrieving, and reasoning over in-context corpora. A CiC prompt is composed of several parts, including task-specific instructions, the entire knowledge corpus, few-shot learning examples with chain-of-thought reasoning, and the new query.
One key advantage of CiC prompting is its compatibility with prefix-caching in autoregressive language models, which means the corpus only needs to be encoded once. For each new request, we only compute the attention values for the new query, and we can reuse the cached attention values of the corpus, few-shot examples, and instructions.
LOFT in Action
DeepMind evaluated Gemini 1.5 Pro (1M context), GPT-4o (128k context), and Claude 3 Opus (200k context) on LOFT, comparing them against fine-tuned models and model pipelines designed for the target task.
The results revealed that LCLMs rival the performance of many specialized models. The researchers found that at 128k tokens, LCLMs rival the performance of Gecko, a leading textual retrieval system. In visual retrieval tasks, Gemini 1.5 Pro outperforms CLIP across all visual benchmarks and context lengths. In audio retrieval, Gemini 1.5 Pro is comparable to PaLM 2 DE across five languages. In SQL tasks, LCLMs achieve reasonable performance, though they are significantly behind specialized pipelines.
The researchers also found that LCLMs lag significantly on complex multi-hop compositional reasoning tasks. And ablation studies revealed that the models’ performance varied depending on prompting strategies such as chain-of-thought reasoning. There is still much to learn about optimizing LCLMs for tasks with large in-context corpora.
Image: LOFT
As the researchers write, “Our results on LOFT demonstrate that LCLMs can match the performance of many specialized models, while also revealing ample headroom for improvement in robust long-context reasoning as context windows continue to scale. We believe that LOFT provides a fertile testing ground for measuring progress in long-context modeling.”