Hugging Face Unveils Open LLM Leaderboard v2: A New Era in AI Evaluation

Hugging Face's Open LLM Leaderboard v2 showcases the superior performance of Chinese AI models, with Alibaba's Qwen models taking top spots. The leaderboard's updated evaluation criteria and benchmarks provide a comprehensive assessment of LLMs' capabilities.

Hugging Face Unveils Open LLM Leaderboard v2: A New Era in AI Evaluation

The latest leaderboard by Hugging Face measures LLMs using four key tasks with six benchmarks, providing a comprehensive evaluation of their capabilities.

Improved Evaluation Criteria and Benchmarks

The new leaderboard assesses LLMs based on four key tasks: knowledge assessment, reasoning with extended contexts, complex mathematics, and instruction following. These tasks are evaluated using six specific benchmarks, including MMLU-Pro, GPQA, MuSR, MATH, IFEval, and BBH. Each benchmark is designed to test different aspects of an LLM’s abilities, from answering science questions to generating truthful responses and solving high-school-level math problems.

AI models being evaluated

Top Performers and Notable Absences

Alibaba’s Qwen models have emerged as top contenders, securing the 1st, 3rd, and 10th spots. Meta’s Llama3-70B also appears on the list, along with several smaller open-source projects outperforming many well-established models. Notably, OpenAI’s ChatGPT is absent from the leaderboard since Hugging Face focuses solely on open-source models to guarantee reproducibility.

Infrastructure and Evaluation Process

The evaluations leverage Hugging Face’s infrastructure, which utilizes 300 Nvidia H100 GPUs. The platform’s open-source nature allows for new model submissions, with popular ones getting prioritized via a voting system. Users can filter the leaderboard to highlight significant models, preventing an overload of minor entries.

Hugging Face’s infrastructure

Meta’s Performance and Over-Specialization

Meta’s updated Llama models have shown weaker performance on the new leaderboard than in prior rankings. This decline is attributed to their specialization on the earlier benchmarks, negatively impacting their real-world usefulness. This situation highlights the necessity of diverse training data to sustain robust AI performance.

“The risk of over-specialization is a significant concern in AI development. By incorporating a diverse array of evaluation criteria, we can ensure that models are well-rounded and effective in real-world applications.” - Source

Conclusion

Hugging Face updates the leaderboard weekly, enabling ongoing evaluation and enhancement of models. This approach ensures that the rankings reflect the latest performance data. Detailed analysis of each model’s performance across individual benchmarks is also provided, giving insights into their strengths and weaknesses. The leaderboard’s open-source framework promotes transparency and reproducibility, with all models and their evaluation results available for public scrutiny.

The future of AI evaluation