Revolutionizing Efficient LLM Serving for LMaaS with Semantic-Based Request Length Prediction
Transformer-based generative Large Language Models (LLMs) have shown considerable strength in a broad range of Natural Language Processing (NLP) tasks. Numerous applications benefit from its wide applicability; however, for most developers, the expense of training and implementing these models is frequently prohibitive. For this, top AI firms like OpenAI, Google, and Baidu offer a language model-as-a-service (LMaaS) by granting access to their LLMs through APIs.
LMaaS Architecture
Application developers provide the LLM service with user input messages and particular instructions in an LMaaS scenario. To provide better quality of service (QoS) and support more customers, service providers strive to decrease response times and boost throughput. However, there are inefficiencies in the way that current systems, such as TensorFlow Serving and Triton Inference Server, handle queries. They do it in a first-come, first-served (FCFS) fashion with a predetermined batch size. These systems employ limited batch sizes, which restricts the GPUs’ capacity for parallel computation to prevent out-of-memory (OOM) issues.
“In many applications, there is a positive correlation between the length of the text that is created and the text that is entered by the user.”
Continuous batching has been suggested to address this, which dynamically eliminates finished requests and adds new ones while processing. This approach frequently uses conservative GPU memory management techniques, which limit throughput by not taking full advantage of the GPUs’ parallel processing capacity. Although they promise to reduce memory, other strategies like model quantization and pruning may lower the caliber of the generated output.
A team of AI researchers from China has proposed Magnus, a system that employs application-level and user-level semantic information in conjunction with the length of the user’s input to forecast request generation lengths properly. Four parts make up Magnus: a batch scheduler, an adaptive batcher, a serving time estimator, and a generation length predictor. The generation length predictor estimates request lengths based on user input, application-level semantic characteristics, and user-level semantic features using a random forest regressor. In order to minimize computational waste, the adaptive batcher groups requests with similar projected lengths and chooses the right batch size.
Magnus Architecture
The batch scheduler chooses batches based on the highest response ratio next (HRRN) policy, minimizing request queue times and reducing response times, and the serving time estimator employs KNN regression to predict batch serving times in order to further improve QoS.
When Magnus’ prototype system was tested using ChatGLM-6B instances on NVIDIA V100 GPUs, it showed notable gains over the baselines in terms of serving latency, request throughput, and serving efficiency. The testbed’s experimental results showed that, in comparison to baseline approaches, Magnus increases request throughput by up to 234% and reduces response times by up to 89.7%. This enhancement demonstrates how well batch serving in LMaaS can be optimized by employing generation length estimates.
Performance Comparison
The proposed Magnus system has the potential to revolutionize the LMaaS landscape by providing efficient and scalable language model serving. With its ability to forecast request generation lengths accurately, Magnus can significantly improve the quality of service and support more customers, making it an attractive solution for developers and service providers alike.