Inference Parameters
Last updated
Last updated
Although they vary from provider to provider, here are the most common settings you can adjust when generating completions.
Language models typically produce a list of possible next tokens before selecting one. A higher temperature makes it more likely than the model will select a less-likely token. That results in more randomness and variety as temperature increases. This is often thought of as a proxy for creativity. At a temperature of 0, the model will always select the most likely token.
Top P is the top percentile of tokens you want to consider when selecting the next token (before temperature is considered). A value of 0.9 corresponds to considering 90% of possibilities, while 0.10 only considers the top 10% of token possibilities. You can use Top P and temperature together for interesting results.
LLMs keep generating text until they generate an "End of Sequence" token, also called a "stop" token. You can also stop them early by setting Max Tokens. It's always a good idea to set a reasonable max limit for token output to prevent excessive token usage, especially with models capable of long outputs.
Some allow you to generate multiple completions in a single request. This is only useful if you have a temperature greater than 0.