Evaluation of text generation speeds for a set of compact Large Language Models

Abstract:

We present an empirical evaluation of several compact large language models (LLMs) from the Llama 2 family. The models are small enough to be run on a typical consumer machine which allows them to be used offline, with increased privacy. Their efficiency at tasks such as giving instructions or explanations on a wide variety of subjects makes them a viable alternative to online language processing tools such as ChatGPT. We evaluate the impact of the GPU offloading, the number of threads and the size of the context on the token generation speed.