Active Preference Elicitation: A Breakthrough in Online Alignment of Large Language Models
Large Language Models (LLMs) have made tremendous progress in recent times, primarily due to their increased capacity to follow human commands efficiently. Reinforcement Learning from Human Feedback (RLHF) is the main technique for matching LLMs to human intent. This method operates by optimizing a reward function, which can be reparameterized within the LLM’s policy or be an independent model.
Illustration of RLHF
Data regarding human preferences for prompt-response pairs are used to derive this reward function. The variety of answers found in the preference data is a critical component of this alignment’s effectiveness. This diversity facilitates the development of more adaptable and powerful language models by preventing reward models from becoming trapped in local optima.
Alignment can be done primarily online or offline. Offline alignment makes an effort to manually generate a variety of responses for predetermined prompts. However, this approach is not very successful in covering the wide range of natural language possibilities. In contrast, online alignment employs an iterative procedure in which new preference data for training the reward model is generated through feedback following the sampling of answers from the LLM.
“Sampling is random in this approach, so out-of-distribution (OOD) regions can be explored.”
On the other hand, the LLM’s only goal in most online RLHF setups is to maximize the expected reward from the data that is gathered. Because of passive exploration, this frequently results in responses that cluster around local optima, which may cause overfitting and premature convergence, leaving high-reward regions unexplored.
Preference optimization has shown great effectiveness in bringing Large Language Models (LLMs) into alignment with human goals, especially when applied with Reinforcement Learning from Human Feedback. Online feedback collection, from humans or AI, on model outputs typically leads to more capable reward models and better-aligned LLMs through an iterative process. This is in contrast to offline alignment, which depends on a fixed dataset.
Illustration of online alignment
However, developing a globally accurate reward model necessitates methodical study to produce a range of responses across the vast field of natural language. This condition cannot be met by just utilizing random sampling from ordinary reward-maximizing LLMs.
To address this issue, a bilevel objective that is optimistically biased towards potentially high-reward responses has been proposed. This method actively investigates regions that are outside of distribution (OOD). The resulting approach, called Self-Exploring Language Models (SELM), solves the inner-level problem with a reparameterized reward function, doing away with the requirement for a separate reward model and updating the LLM repeatedly with a simple objective.
Illustration of SELM
The SELM aims to improve exploration efficiency and lessen the indiscriminate favoring of unseen extrapolations when compared to Direct Preference Optimisation (DPO). Based on experimental findings, SELM can greatly enhance performance on instruction-following benchmarks like MT-Bench and AlpacaEval 2.0 when modified on the Zephyr-7B-SFT and Llama-3-8B-Instruct models. SELM also performs well on a range of common academic standards in diverse contexts.
In conclusion, by guaranteeing that LLMs not only precisely obey instructions but also consider a broad range of possible replies, this approach marks a substantial advancement in matching LLMs with human intent and will eventually result in more capable and reliable language models.