FrugalGPT and Decreasing LLM Working Prices | by Matthew Gunton | Mar, 2024


There are a number of methods to find out the price of operating a LLM (electrical energy use, compute value, and so on.), nonetheless, should you use a third-party LLM (a LLM-as-a-service) they sometimes cost you primarily based on the tokens you employ. Completely different distributors (OpenAI, Anthropic, Cohere, and so on.) have alternative ways of counting the tokens, however for the sake of simplicity, we’ll contemplate the associated fee to be primarily based on the variety of tokens processed by the LLM.

Crucial a part of this framework is the concept completely different fashions value completely different quantities. The authors of the paper conveniently assembled the beneath desk highlighting the distinction in value, and the distinction between them is critical. For instance, AI21’s output tokens value an order of magnitude greater than GPT-4’s does on this desk!

Desk 1 from the paper

As part of value optimization we at all times want to determine a technique to optimize the reply high quality whereas minimizing the associated fee. Sometimes, larger value fashions are sometimes larger performing fashions, capable of give larger high quality solutions than decrease value ones. The overall relationship may be seen within the beneath graph, with Frugal GPT’s efficiency overlaid on prime in crimson.

Determine 1c from the paper evaluating varied LLMs primarily based on the how typically they’d precisely reply to questions primarily based on the HEADLINES dataset

Utilizing the huge value distinction between fashions, the researchers’ FrugalGPT system depends on a cascade of LLMs to offer the consumer a solution. Put merely, the consumer question begins with the most affordable LLM, and if the reply is sweet sufficient, then it’s returned. Nonetheless, if the reply is just not adequate, then the question is handed alongside to the following most cost-effective LLM.

The researchers used the next logic: if a cheaper mannequin solutions a query incorrectly, then it’s doubtless {that a} dearer mannequin will give the reply accurately. Thus, to attenuate prices the chain is ordered from least costly to costliest, assuming that high quality goes up as you get dearer.

Determine 2e from the paper illustrating the LLM cascade

This setup depends on reliably figuring out when a solution is sweet sufficient and when it isn’t. To resolve for this, the authors created a DistilBERT mannequin that will take the query and reply then assign a rating to the reply. Because the DistilBERT mannequin is exponentially smaller than the opposite fashions within the sequence, the associated fee to run it’s virtually negligible in comparison with the others.

One would possibly naturally ask, if high quality is most necessary, why not simply question the most effective LLM and work on methods to scale back the price of operating the most effective LLM?

When this paper got here out GPT-4 was the most effective LLM they discovered, but GPT-4 didn’t at all times give a greater reply than the FrugalGPT system! (Eagle-eyed readers will see this as a part of the associated fee vs efficiency graph from earlier than) The authors speculate that simply as essentially the most succesful individual doesn’t at all times give the fitting reply, essentially the most advanced mannequin received’t both. Thus, by having the reply undergo a filtering course of with DistilBERT, you might be eradicating any solutions that aren’t as much as par and rising the percentages of a great reply.

Determine 5a from the paper displaying situations the place FrugalGPT is outperforming GPT-4

Consequently, this technique not solely reduces your prices however may also improve high quality extra so than simply utilizing the most effective LLM!

The outcomes of this paper are fascinating to think about. For me, it raises questions on how we are able to go even additional with value financial savings with out having to spend money on additional mannequin optimization.

One such risk is to cache all mannequin solutions in a vector database after which do a similarity search to find out if the reply within the cache works earlier than beginning the LLM cascade. This may considerably cut back prices by changing a expensive LLM operation with a relatively cheaper question and similarity operation.

Moreover, it makes you surprise if outdated fashions can nonetheless be value cost-optimizing, as should you can cut back their value per token, they’ll nonetheless create worth on the LLM cascade. Equally, the important thing query right here is at what level do you get diminishing returns by including new LLMs onto the chain.

Leave a Reply

Your email address will not be published. Required fields are marked *