LLM Inference Settings

LLM Inference Settings

With different language models, we’ve found that adjusting settings like Seed, Maximum Tokens, Temperature, Top P, and Top K is incredibly helpful for achieving high quality content with low latency. Adjustments like these allow the model to respond to specific requirements. Seeing how useful these tweaks are, we thought it would be useful to share a brief overview of each setting. The following is a short overview of each setting:

Seed Seed refers to a value that initializes the random number generator used by the model to generate text. It determines the sequence of random numbers used to sample from the model’s output probabilities.

Setting a specific seed value ensures that the same sequence of random numbers is used every time the model generates text. This results in the same or similar output.

On the other hand, if a random seed value is used (i.e., the seed is not explicitly set), the model will use a different sequence of random numbers each time it generates text, resulting in different output.

Maximum Tokens The Maximum Tokens parameter encompasses both the input tokens, which refer to the prompt or context given to the model, and the output tokens, meaning the response generated by the model. For example, if the maximum tokens parameter is set to 2048, the total number of tokens including both input and output should not exceed 2048. This means that if a longer prompt is provided, the generated response will be shorter to stay within the maximum token limit.

Please note that setting the maximum token limit too low may lead to responses getting cut off or incomplete. In contrast, setting the limit too high could affect the system’s efficiency. For this reason, it’s a good idea to tailor the maximum token limit to your needs.

Temperature The temperature setting controls the randomness of the model’s responses during text generation. It influences how the AI selects the next token in a sequence, affecting the creativity and predictability of the output.

A low temperature value (closer to 0) makes the model’s responses more deterministic and focused. AI chooses the most probable next word, leading to more predictable and less varied text.

A high temperature value (closer to 1) increases randomness in model responses. This allows the model to select less probable words, resulting in more creative, diverse, and sometimes less coherent text. However, a very high temperature can also increase the risk of nonsensical or off-topic content, known as “hallucinations”.

Top P Top P, also known as Nucleus Sampling, is a method to control text generation randomness by language models. It is a hyperparameter that influences which tokens (words or parts of words) the model considers when generating the next part of text.

When a language model generates text, it assigns a probability to each possible next token based on the context it has seen so far. Top P sampling involves selecting a subset of these tokens whose cumulative probability exceeds a certain threshold P. This threshold is set by the Top P value.

A higher Top P value allows for more diversity in the generated text because it includes less probable tokens in the sampling process. Conversely, a lower Top P value makes the model’s output more predictable and focused, as it restricts the selection to a smaller set of more likely tokens.

Unlike Top K sampling, which selects a fixed number of the most probable tokens, Top P’s dynamic shortlisting adapts to the probability distribution of the tokens. This means the number of tokens considered can vary depending on their probabilities and the chosen P value.

Top K Top K is a hyperparameter that determines the number of most likely next tokens that the model will consider when generating text.

When a language model generates text, it calculates the probability of each possible next token based on the context provided. Top K, also known as Top K sampling restricts the model’s choices to the K most probable tokens. For example, if K is set to 40, the model will only consider the top 40 most likely tokens as candidates for the next word in the sequence.

By setting the Top K value, users can influence the diversity and predictability of the model’s output. A smaller K value leads to more predictable text, while a larger K value allows for more variation and creativity. For applications where both the quality of the output and the computational efficiency are important, setting Top K to 40 can help manage the trade-off by limiting the scope of computation to 40 possibilities at each step of the generation process.

Practical Example Seed=10, Maximum tokens=2048, Temperature=0.2, Top P=0.8, and Top K=40, as shown in the image at the beginning of this blog, represents an approach to creating text with a language model that balances predictability and diversity. Here’s a quick analysis of how these settings work together:

Seed = 10

This ensures reproducibility. With the same seed value, the model will generate the same or similar text sequence for a given input. It’s handy for testing and comparing model behavior.

Maximum Tokens = 2048

This is a fairly high limit, so longer texts are allowed. It’s great for applications that need detailed responses, like writing articles, reports, or stories. However, generating such a long sequence might increase computational demands and processing time.

Temperature = 0.2

A low temperature value like this biases the model towards more predictable, less varied text. It’s great for technical documentation or specific factual answers, where accuracy and relevance are more important than creativity.

Top P = 0.8

With this setting, tokens that cumulatively make up 80% of the probability mass are taken into account, which allows for a moderate level of creativity and variability. It’s a good balance that can keep the text coherent while adding diversity.

Top K = 40

Limiting the model to consider only the top 40 most likely next tokens at each step ensures relevance and coherence. This value will strip out highly improbable tokens that make the text illogical or off-topic.