Inference Parameters

Inference Parameters

Although they vary from provider to provider, here are the most common settings you can adjust when generating completions.

Temperature

Language models typically produce a list of possible next tokens before selecting one. A higher temperature makes it more likely than the model will select a less-likely token. That results in more randomness and variety as temperature increases. This is often thought of as a proxy for creativity. At a temperature of 0, the model will always select the most likely token.

There are sources of non-determinism that originate from the way GPUs perform matrix multiplication that could result in different outputs at a temperature of 0, but these are negligible compared to the impact of temperature.

Top P

Top P is the top percentile of tokens you want to consider when selecting the next token (before temperature is considered). A value of 0.9 corresponds to considering 90% of possibilities, while 0.10 only considers the top 10% of token possibilities. You can use Top P and temperature together for interesting results.

Max Tokens

LLMs keep generating text until they generate an "End of Sequence" token, also called a "stop" token. You can also stop them early by setting Max Tokens. It's always a good idea to set a reasonable max limit for token output to prevent excessive token usage, especially with models capable of long outputs.

Number of Completions

Some model providers allow you to generate multiple completions in a single request. This is only useful if you have a temperature greater than 0.

Last updated