Unveiling the Power of FrankenMoEs: A Deep Dive into MergeKit

Uncover the intricacies of creating 'frankenMoEs' with MergeKit, revolutionizing AI model development. Explore the architecture of MoEs, the challenges they pose, and the innovative methodologies behind initializing routers. Dive into the world of expert selection and YAML configurations for crafting optimal 'frankenMoEs'.
Unveiling the Power of FrankenMoEs: A Deep Dive into MergeKit
Photo by JOHN TOWNER on Unsplash

Unveiling the Power of FrankenMoEs: A Deep Dive into MergeKit

In the realm of AI architecture, the Mixture of Experts (MoE) has emerged as a frontrunner, offering a unique blend of enhanced performance and increased VRAM usage. While traditional MoEs are typically built from the ground up, a novel approach presented by MergeKit introduces the concept of ‘frankenMoEs,’ revolutionizing the creation process by amalgamating multiple pre-trained models.

Delving into MoEs

MoEs represent a cutting-edge architectural design aimed at optimizing efficiency and efficacy. By incorporating multiple specialized subnetworks, known as ’experts,’ MoEs can selectively activate based on input relevance, leading to expedited training and streamlined inference. The core components of an MoE model consist of Sparse MoE Layers and a Gate Network or Router.

A visual representation of MoE architecture

The Role of Sparse MoE Layers

Sparse MoE Layers serve as a replacement for the conventional dense feed-forward network layers within the transformer architecture. Each MoE layer houses numerous experts, with only a subset engaged per input, thereby reducing the memory load during inference and enhancing overall memory efficiency.

Understanding the Gate Network or Router

The Gate Network or Router plays a pivotal role in dictating which tokens are processed by specific experts, ensuring optimal handling of input segments.

Despite their advantages, MoEs present unique challenges, particularly concerning fine-tuning complexities and memory demands. The intricate fine-tuning process necessitates a delicate balance in expert utilization during training to effectively train gating weights. Moreover, while only a fraction of total parameters are active during inference, the entire model, including all experts, must reside in memory, demanding substantial VRAM capacity.

Distinguishing True MoEs from FrankenMoEs

A fundamental disparity between true MoEs and frankenMoEs lies in their training methodologies. True MoEs entail joint training of experts and routers, whereas frankenMoEs leverage existing models, initializing the router post-upcycling. This shared parameter approach results in optimized model efficiency, as evidenced by reduced parameter counts and enhanced inference speeds.

The Art of Initializing Routers

MergeKit offers three distinct router initialization methods: Random, Cheap embed, and Hidden. Each method caters to varying computational requirements and hardware capabilities.

Random Initialization

Random weights are assigned, albeit with caution due to potential repetitive expert selection. Further fine-tuning or specific routing configurations may be necessary to mitigate this issue.

Cheap Embed Strategy

This method directly utilizes raw input token embeddings across all layers, ensuring computational efficiency, especially on less potent hardware setups.

Hidden Representation Technique

By extracting hidden representations from the LLM’s final layer, MergeKit generates averaged and normalized prompts to initialize gates effectively. For detailed insights, refer to Charles Goddard’s blog.

Crafting a FrankenMoE with MergeKit

To construct a ‘frankenMoE,’ the selection of experts is paramount. In this instance, leveraging Mistral-7B for its popularity and compact size, we opt for four experts, with two engaged per token and layer. This strategic configuration yields a model with 24.2B parameters, optimizing performance and efficiency.

Expert Selection Process

The quest for a versatile model capable of diverse tasks, such as storytelling, article explanation, and Python coding, necessitates expert selection tailored to specific domains.

  • Chat Model: mlabonne/AlphaMonarch-7B
  • Code Model: beowolx/CodeNinja-1.0-OpenChat-7B
  • Math Model: mlabonne/NeuralDaredevil-7B
  • Role-play Model: SanjiWatsuki/Kunoichi-DPO-v2-7B

YAML Configuration Blueprint

Below is the YAML configuration utilized by MergeKit to instantiate our ‘frankenMoE’:

base_model: mlabonne/AlphaMonarch-7B
experts:
  - source_model: mlabonne/AlphaMonarch-7B
    positive_prompts:
      - 'chat'
      - 'assistant'
      - 'tell me'
      - 'explain'
      - 'I want'
  - source_model: beowolx/CodeNinja-1.0-OpenChat-7B
    positive_prompts:
      - 'code'
      - 'python'
      - 'javascript'
      - 'programming'
      - 'algorithm'
  - source_model: SanjiWatsuki/Kunoichi-DPO-v2-7B
    positive_prompts:
      - 'storywriting'
      - 'write'
      - 'scene'
      - 'story'
      - 'character'
  - source_model: mlabonne/NeuralDaredevil-7B
    positive_prompts:
      - 'reason'
      - 'math'
      - 'mathematics'
      - 'solve'
      - 'count'

Conclusion

In conclusion, our exploration of the Mixture of Experts architecture, coupled with the innovative approach of creating ‘frankenMoEs’ using MergeKit, sheds light on the evolving landscape of AI model development. While frankenMoEs present inherent challenges such as heightened VRAM demands and slower inference speeds, their potential for knowledge retention and model robustness signifies a promising trajectory in AI advancement. By leveraging the right hardware and strategic configurations, the drawbacks associated with frankenMoEs can be effectively mitigated, paving the way for enhanced model performance and efficiency.