Benchmark tests are the SATs of the AI world
By 2024, the use of large language models (LLMs) has become ubiquitous in various industries, from healthcare to finance. But before these models can be deployed, they have to pass a series of benchmark tests to ensure their accuracy and reliability. Michal Shmueli-Scheuer, senior technical staff member for IBM’s Foundation Models Evaluation group, explains why designing these benchmark tests is harder than you might think.
Standardized testing isn’t just for humans anymore. It turns out that large language models (LLMs) have to pass their own version of the SAT (or GCSE or ATAR or whatever depending on what country you’re in) before they hit prime time. But as the head of IBM’s foundation model evaluation team told us, designing benchmarking tests for artificial intelligence (AI) is harder than you might think.
- captions: RAG technology has the potential to transform the way we interact with language models. *
Models are trained to perform a specific function. They are then tested using tasks (a.k.a. benchmarks) relevant to that function and scored on how well they perform. Performance across all the tasks is aggregated into a series of benchmark metrics.
But why is evaluation testing so hard? Well, first of all, unlike the SATs, there’s no single broad-spectrum test that’s the be-all-end-all. Sure, there are different options like Arena, MMLU, BBH, and others. But while these tests are in high agreement on which model is better when testing a large number of LLMs (since it’s easy to tell broadly what’s good and what’s bad), they tend to disagree on which model is best when evaluating a smaller number of LLMs.
- captions: A highly interpretable visual representation of the LLM’s thought process. *
Furthermore, researchers published a paper in Science back in April 2023 warning that aggregate metrics limit our insight into performance in particular situations, making it harder to find system failure points and robustly evaluate system safety.
To solve for these issues, Shmueli-Scheuer said IBM strives to design tests that align with four key pillars: representativeness, reliability, efficiency, and validity.
“It’s very hard work,” Shmueli-Scheuer said of evaluation testing.
- captions: DocsBot is an online platform that helps users calculate and compare the costs of using different LLM APIs. *
For businesses and developers looking to integrate these advanced technologies, understanding the cost implications is crucial. DocsBot serves as a bridge, simplifying this process by providing a straightforward platform for cost comparison among leading LLM providers.
Vectara Secures $25 Million Series A Funding
Vectara, the trusted Generative AI product platform, has closed a $25 million Series A round led by FPV Ventures and Race Capital. This funding round, combined with last year’s $28.5 million seed funding round, brings the total funding to $53.5 million, aimed at advancing the state of Retrieval Augmented Generation (RAG) as a Service for regulated industries.
With this funding, Vectara will advance internal innovations, ramp up its go-to-market resources, and expand its offering in Australia and EMEA regions.
Introducing Mockingbird for RAG Technology
Vectara is excited to unveil Mockingbird, a new, fine-tuned generative Large Language Model (LLM) specifically designed for RAG applications. Mockingbird is engineered to reduce hallucinations and improve structured output, providing reliable performance with low latency and cost efficiency.
Combining Mockingbird with Vectara’s Hughes Hallucination Evaluation Model (HHEM) makes it particularly beneficial for regulated industries such as health, legal, finance, and manufacturing, where accuracy, security, and explainability are critical.
- captions: Vectara’s mission is to advance the trustworthiness of Retrieval Augmented Generation. *