Hugging Face Unveils Open LLM Leaderboard v2: A New Era in AI Evaluation
The latest leaderboard by Hugging Face measures LLMs using four key tasks with six benchmarks, providing a comprehensive evaluation of their capabilities.
Improved Evaluation Criteria and Benchmarks
The new leaderboard assesses LLMs based on four key tasks: knowledge assessment, reasoning with extended contexts, complex mathematics, and instruction following. These tasks are evaluated using six specific benchmarks, including MMLU-Pro, GPQA, MuSR, MATH, IFEval, and BBH. Each benchmark is designed to test different aspects of an LLM’s abilities, from answering science questions to generating truthful responses and solving high-school-level math problems.
AI models being evaluated
Top Performers and Notable Absences
Alibaba’s Qwen models have emerged as top contenders, securing the 1st, 3rd, and 10th spots. Meta’s Llama3-70B also appears on the list, along with several smaller open-source projects outperforming many well-established models. Notably, OpenAI’s ChatGPT is absent from the leaderboard since Hugging Face focuses solely on open-source models to guarantee reproducibility.
Infrastructure and Evaluation Process
The evaluations leverage Hugging Face’s infrastructure, which utilizes 300 Nvidia H100 GPUs. The platform’s open-source nature allows for new model submissions, with popular ones getting prioritized via a voting system. Users can filter the leaderboard to highlight significant models, preventing an overload of minor entries.
Hugging Face’s infrastructure
Meta’s Performance and Over-Specialization
Meta’s updated Llama models have shown weaker performance on the new leaderboard than in prior rankings. This decline is attributed to their specialization on the earlier benchmarks, negatively impacting their real-world usefulness. This situation highlights the necessity of diverse training data to sustain robust AI performance.
“The risk of over-specialization is a significant concern in AI development. By incorporating a diverse array of evaluation criteria, we can ensure that models are well-rounded and effective in real-world applications.” - Source
Conclusion
Hugging Face updates the leaderboard weekly, enabling ongoing evaluation and enhancement of models. This approach ensures that the rankings reflect the latest performance data. Detailed analysis of each model’s performance across individual benchmarks is also provided, giving insights into their strengths and weaknesses. The leaderboard’s open-source framework promotes transparency and reproducibility, with all models and their evaluation results available for public scrutiny.
The future of AI evaluation