Democratizing AI: The Rise of Transparent Language Models
The AI research community has witnessed remarkable advancements in language models, with proprietary models like GPT, Gemini, and Claude achieving state-of-the-art performance. However, these closed-source models lack transparency in their training data and methods, hindering scientific progress and democratization of AI development.
Open-source models like LLaMA-3 have provided weights, but still lack transparency in their training data and methods. Efforts to create fully transparent language models, such as Pythia, Amber, and OLMo, aim to enhance scientific research by sharing more details, including pre-training data and training code. Despite these efforts, open-source language models still lag behind state-of-the-art models in tasks like reasoning, knowledge, and coding.
MAP-Neo: A Fully Open-Source and Transparent Bilingual LLM Suite
Researchers from M-A-P, University of Waterloo, Wuhan AI Research, and 01.AI have released MAP-Neo, a highly capable and transparent bilingual language model with 7 billion parameters, trained on 4.5 trillion high-quality tokens. This model, fully open-sourced, matches the performance of leading closed-source language models. The release includes the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and an optimized training and evaluation framework.
“The advancement of open-source language models is crucial for AI research and applications. Recent efforts focus on enhancing both performance and transparency.”
MAP-Neo-7B stands out by integrating intermediate checkpoints, a comprehensive data cleaning process, accessible pre-training corpus, and reproduction code, unlike Mistral, LLaMA3, Pythia, Amber, and OLMo models. MAP-Neo-7B excels in benchmarks for Chinese and English understanding (C-EVAL, MMLU), mathematical ability (GSM8K), and coding (HumanEval). It achieves high scores across all tests and sets a new standard for transparency and performance, promoting trustworthiness and collaboration in the research community.
MAP-Neo’s Tokenizer
The tokenizer is trained using byte-pair encoding (BPE) via SentencePiece on 50 billion samples, with a capping length of 64,000. Priority is given to code, math, and academic data. The vocabulary size is 64,000 with a maximum sentence-piece length of 16 to enhance Chinese performance. Numbers are tokenized as individual digits, and unknown UTF-8 characters revert to byte granularity. No normalization or dummy prefixes are applied, maintaining character coverage at 99.99%. Extra whitespace removal is disabled to preserve code formatting and improve performance after addressing initial training issues.
MAP-Neo’s Performance
The MAP-Neo model family exhibits impressive performance across benchmarks for base and chat models. It particularly excels in code, math, and instruction-following tasks. MAP-Neo outperforms other models in standard benchmarks, demonstrating its academic and practical value. The base model’s high-quality data contributes to its superior results in complex reasoning tasks. Compared to other transparent language models, MAP-Neo shows significant advancements.
“Data colonialism is a concern as firms exploit algorithms, leading to the manipulation of human behavior and market dominance. The concentration of AI capabilities in large tech firms and elite universities highlights the need for democratizing AI access to counter data colonialism.”
In conclusion, MAP-Neo addresses these issues by being a fully open-source bilingual language model, detailing all key processes. This transparency can reduce deployment costs, particularly for Chinese language models, promoting innovation inclusivity and mitigating the dominance of English language models.