Reducing the Cost of Large Language Models: A Frugal Approach

This article explores the concept of FrugalGPT, a cost-saving architecture for LLM-driven apps that reduces cost and improves performance. It discusses the cost comparison of different LLMs, the relationship between cost and performance, and the cascading LLM system.
Reducing the Cost of Large Language Models: A Frugal Approach

Reducing the Cost of Large Language Models: A Frugal Approach

As the use of large language models (LLMs) continues to grow, so does the cost of running them. The cost of LLMs can be measured in various ways, but for third-party LLM-as-a-service providers, the cost is typically based on the number of tokens processed by the LLM. Different vendors have different ways of counting tokens, but for simplicity, let’s consider the cost to be based on the number of tokens.

Cost comparison of different LLMs

The authors of the paper “FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance” provide a cost comparison of different LLMs, highlighting the significant difference in cost between various LLMs.

Performance vs Cost

The general relationship between cost and performance can be seen in the following graph, with FrugalGPT’s performance overlaid in red.

Performance vs Cost

Cascading LLMs

FrugalGPT’s system relies on a cascade of LLMs to give the user an answer. The user query begins with the cheapest LLM, and if the answer is good enough, it is returned. However, if the answer is not good enough, the query is passed along to the next cheapest LLM. The researchers used the following logic: if a less expensive model answers a question incorrectly, then it is likely that a more expensive model will give the correct answer. Thus, the chain is ordered from least expensive to most expensive, assuming that quality goes up as you get more expensive.

Cascading LLMs

Better Average Quality Than Just Querying the Best LLM

One might ask, if quality is most important, why not just query the best LLM and work on ways to reduce its cost? When this paper was published, GPT-4 was the best LLM, yet it did not always give a better answer than the FrugalGPT system! The authors speculate that, just as the most capable person doesn’t always give the right answer, the most complex model won’t either. Thus, by having the answer go through a filtering process with DistilBERT, you are removing any answers that aren’t up to par and increasing the odds of a good answer.

Better Average Quality

Moving Forwards with Cost Savings

The results of this paper are intriguing and raise questions about how we can go even further with cost savings without investing in further model optimization. One such possibility is to cache all model answers in a vector database and then do a similarity search to determine if the answer in the cache works before starting the LLM cascade. This would significantly reduce costs by replacing a costly LLM operation with a comparatively less expensive query and similarity operation.

Additionally, it makes you wonder if outdated models can still be worth cost-optimizing, as if you can reduce their cost per token, they can still create value on the LLM cascade. Similarly, the key question here is at what point do you get diminishing returns by adding new LLMs onto the chain.

Questions for Further Study

As the world creates more LLMs and we increasingly build systems that use them, we will want to find cost-effective ways to run them. This paper creates a strong framework for future builders to expand on, making me wonder about how far this framework can go.

In my opinion, this framework applies really well for general queries that do not have different answers based on different users, such as a tutor LLM. However, for use cases where answers differ based on the user, say a LLM that acts as a customer service agent, the scoring system would have to be aware of who the LLM was talking with.

“The cost of running a LLM can be measured in various ways, but for third-party LLM-as-a-service providers, the cost is typically based on the number of tokens processed by the LLM.” - Towards Data Science