Chinese AI Models Take the Lead in Hugging Face’s LLM Chatbot Benchmark Leaderboard
As I delved into the world of large language models (LLMs), I couldn’t help but notice a significant shift in the landscape. Hugging Face, a pioneer in the field, has released its second LLM leaderboard, and the results are nothing short of astonishing. Chinese AI models, particularly Alibaba’s Qwen models, have dominated the top spots, leaving major US competitors in the dust.
Chinese AI models take the lead
The new leaderboard is designed to be a more challenging and uniform standard for testing open LLM performance across a variety of tasks. The tests include knowledge testing, reasoning on extremely long contexts, complex math abilities, and instruction following. Six benchmarks are used to evaluate these qualities, with tests that range from solving 1,000-word murder mysteries to explaining PhD-level questions in layman’s terms.
“Qwen 72B is the king and Chinese open models are dominating overall.” - Hugging Face
The frontrunner of the new leaderboard is Qwen, Alibaba’s LLM, which takes 1st, 3rd, and 10th place with its variants. Other notable models that made the cut include Llama3-70B, Meta’s LLM, and a handful of smaller open-source projects that managed to outperform the pack.
Hugging Face’s leaderboard
What’s striking is the absence of ChatGPT, a closed-source model that couldn’t be tested due to reproducibility concerns. This highlights the importance of open-source collaboration in the LLM space.
As I explored the leaderboard, I noticed a trend of over-training LLMs only on the first leaderboard’s benchmarks, leading to regressing real-world performance. This phenomenon is reminiscent of Google’s AI answers, which have shown that LLM performance is only as good as its training data.
LLM performance is only as good as its training data
The creation of a second leaderboard is a response to the growing concern that LLMs are becoming too specialized and losing their real-world applicability. By introducing new and more challenging tests, Hugging Face aims to encourage developers to create more versatile and effective models.
As I reflect on the implications of this shift, I’m reminded of the importance of collaboration and open-source innovation in the LLM space. The future of artificial intelligence depends on our ability to create models that are not only powerful but also adaptable and reliable.
The future of AI