Rethinking QA Dataset Design: How Popular Knowledge Enhances LLM Accuracy?
Large language models (LLMs) have gained significant attention for their ability to store vast amounts of factual knowledge within their weights during pretraining. This capability has led to promising results in knowledge-intensive tasks, particularly factual question-answering. However, a critical challenge persists: LLMs often generate plausible but incorrect responses to queries, undermining their reliability. This inconsistency in factual accuracy poses a significant hurdle in the widespread adoption and trust of LLMs for knowledge-based applications.
“The problem is further complicated by the observation that even when LLMs possess the correct information, they may still produce inaccurate answers, suggesting underlying issues in knowledge retrieval and application.”
Researchers have attempted various approaches to improve factuality in LLMs. Some studies focus on the impact of unfamiliar examples during fine-tuning, revealing that these can potentially worsen factuality due to overfitting. Other approaches examine the reliability of factual knowledge, showing LLMs often underperform on obscure information. Techniques to enhance factuality include manipulating attention mechanisms, using unsupervised internal probes, and developing methods for LLMs to abstain from answering uncertain questions.
LLMs struggle to retrieve accurate knowledge
Despite these efforts, challenges in achieving consistent factual accuracy persist. In a recent study, researchers from the Department of Machine Learning at Carnegie Mellon University and the Department of Computer Science at Stanford University found that the impact of fine-tuning examples on LLMs depends critically on how well the facts are encoded in the pre-trained model. Fine-tuning on well-encoded facts significantly improves factuality, while using less well-encoded facts can harm performance.
“This phenomenon occurs because LLMs can either use memorized knowledge or rely on general shortcuts to answer questions. The composition of fine-tuning data determines which mechanism is amplified.”
The study utilizes a synthetic setup to study the impact of fine-tuning data on LLM factuality. This setup simulates a simplified token space for subjects, relations, and answers, with different formatting between pretraining and downstream tasks. Pretraining samples are drawn from a Zipf distribution for subjects and a uniform distribution for relations.
Fine-tuning data composition affects LLM factuality
Experimental results across multiple datasets (PopQA, Entity-Questions, and MMLU) and models (Llama-7B and Mistral) consistently show that fine-tuning on less popular or less confident examples underperforms compared to using popular knowledge. This performance gap widens for less popular test points, supporting the hypothesis that less popular facts are more sensitive to fine-tuning choices.
“Surprisingly, even randomly selected subsets outperform fine-tuning on the least popular knowledge, suggesting that including some popular facts can mitigate the negative impact of less popular ones.”
The study provides significant insights into improving language model factuality through strategic QA dataset composition. Contrary to intuitive assumptions, fine-tuning on well-known facts consistently enhances overall factuality. This finding, observed across various settings and supported by a conceptual model, challenges conventional approaches to QA dataset design.
Strategic QA dataset composition improves LLM factuality
The research opens new avenues for improving language model performance, suggesting potential benefits in regularization techniques to overcome attention imbalance, curriculum learning strategies, and the development of synthetic data for efficient knowledge extraction. These findings provide a foundation for future work aimed at enhancing the factual accuracy and reliability of language models in diverse applications.