MuxServe: Revolutionizing the Efficient Serving of Multiple Large Language Models

MuxServe, a novel spatial-temporal multiplexing system, efficiently serves multiple Large Language Models, addressing GPU utilization challenges and achieving higher throughput and better SLO attainment.
MuxServe: Revolutionizing the Efficient Serving of Multiple Large Language Models

MuxServe: Revolutionizing the Efficient Serving of Multiple Large Language Models

The advent of Large Language Models (LLMs) has transformed various applications, including chat, programming, and search. However, the efficient serving of multiple LLMs has emerged as a critical challenge for endpoint providers. The primary issue lies in the substantial computational requirements of these models, with a single 175B LLM demanding eight A100 (80GB) GPUs for inference.

GPU utilization challenges in serving multiple LLMs

Current methodologies, particularly spatial partitioning, need to improve in resource utilization. This approach allocates separate GPU groups for each LLM, leading to underutilization due to varying model popularity and request rates. Consequently, less popular LLMs result in idle GPUs, while popular ones experience performance bottlenecks, highlighting the need for more efficient serving strategies.

Existing attempts to solve LLM serving challenges have explored various approaches. Deep learning serving systems have focused on temporal multiplexing and scheduling strategies, but these are primarily designed for smaller models. LLM-specific systems have advanced through customized GPU kernels, parallelism techniques, and optimizations like memory management and offloading. However, these methods typically target single LLM inference. GPU sharing techniques, including temporal and spatial sharing, have been developed to improve resource utilization, but they are generally tailored for smaller DNN jobs.

“The efficient serving of multiple LLMs has emerged as a critical challenge for endpoint providers.” - Researchers from The Chinese University of Hong Kong, Shanghai AI Laboratory, Huazhong University of Science and Technology, Shanghai Jiao Tong University, Peking University, UC Berkeley, and the UC San Diego

MuxServe: A Flexible Spatial-Temporal Multiplexing System

Researchers from The Chinese University of Hong Kong, Shanghai AI Laboratory, Huazhong University of Science and Technology, Shanghai Jiao Tong University, Peking University, UC Berkeley, and the UC San Diego present MuxServe, a flexible spatial-temporal multiplexing approach for serving multiple LLMs, addressing GPU utilization challenges. It separates prefill and incremental decoding phases, colocates jobs based on LLM popularity, and employs an optimization framework to determine ideal resource allocation. The system uses a greedy placement algorithm, adaptive batch scheduling, and a unified resource manager to maximize efficiency.

MuxServe architecture for efficient multi-LLM serving

MuxServe demonstrates superior performance in both synthetic and real-world workloads. In synthetic scenarios, it achieves up to 1.8× higher throughput and processes 2.9× more requests within 99% SLO attainment compared to baseline systems. The system’s efficiency varies with workload distribution, showing particular strength when LLM popularity is diverse. In real workloads derived from ChatLMSYS traces, MuxServe outperforms spatial partitioning and temporal multiplexing by 1.38× and 1.46× in throughput, respectively.

“MuxServe achieves up to 1.8× higher throughput and processes 2.9× more requests within 99% SLO attainment compared to baseline systems.” - Researchers from The Chinese University of Hong Kong, Shanghai AI Laboratory, Huazhong University of Science and Technology, Shanghai Jiao Tong University, Peking University, UC Berkeley, and the UC San Diego

Conclusion

MuxServe represents a significant advancement in the field of LLM serving. By introducing flexible spatial-temporal multiplexing, the system effectively addresses the challenges of serving multiple LLMs concurrently. Its innovative approach of colocating LLMs based on their popularity and separating prefill and decoding jobs leads to improved GPU utilization. This method demonstrates substantial performance gains over existing systems, achieving higher throughput and better SLO attainment across various workload scenarios. MuxServe’s ability to adapt to different LLM sizes and request patterns makes it a versatile solution for the growing demands of LLM deployment.

MuxServe performance in real-world workloads

As the AI industry continues to evolve, MuxServe provides a promising framework for efficient and scalable LLM serving. With its flexible spatial-temporal multiplexing approach, MuxServe is poised to revolutionize the way we serve multiple LLMs, enabling faster and more efficient processing of complex AI workloads.