Quantity & Quality of Data
In this section, we'll provide insight into how much data you need, and what it means to have a high-quality dataset.
The first step to answer these questions is to define the purpose of your dataset.
The Purpose of a Dataset
There are three primary use cases for a dataset in Entry Point:
Evaluating a prompt (see Templated Models)
Running Transforms to generate new data
Fine-tuning a model
Let's consider the data requirements for each of these use cases in turn.
When evaluating a prompt, the amount of data you need depends on the nature of your use case. More specifically, your data requirements will be driven by the level of assurance you need to put that prompt into production. For example, 25-100 examples might be sufficient for an internal use case that won't cause any issues if the data is wrong. Conversely, you might need thousands of validation examples to gain confidence in a customer-facing support agent, where a bad outcome could damage the reputation of your business or put it on the hook for promises made.
When it comes to our next use case—running transforms to generate new data—there are no hard requirements concerning the size of your dataset. Transforms work for any amount of data, so the number of examples in your dataset is entirely driven by what you're trying to accomplish.
This brings us to our third and final use case. Let's look at how much data you need to fine-tune a model.
How Much Data Do I Need for Fine-tuning?
As in our other use cases, the amount of data you need for fine-tuning a model depends on the specifics of your project, especially how complex it is. However, there are some principles we can apply to any scenario to help us determine an appropriate dataset size.
Generally speaking, we recommend starting with at least 25 examples to see your model's behavior change resulting from the fine-tuning. If you have more training data available to you, then using anywhere from 100-500 examples may be a better starting point.
If you are new to fine-tuning, start with a dataset size that won't overwhelm you, particularly if you need to manually re-label some data. We recommend starting with a manageable number of examples because, as we discuss in the next section, quality is key.
The general rule is that every time you double your data, you should expect a linear improvement in output quality. That means if you increase your dataset from 25 examples to 50, you might see a 20% improvement in quality. If you double it again from 50 to 100 examples, you could see another 20% improvement, and so on.
Quality of Data
The best way to think about data is in terms of quality. That means your outputs need to be really good—even when your inputs are a mess. This teaches the model to do a good job and to do it consistently.
Ask yourself this question: if you were to teach a human how to do a task, what would you need to provide them? The goal is to make it as easy as possible for the AI model to learn. The relationship between the input data and the output should be clear and consistent between examples, otherwise, the model will have a hard time learning, and will not converge on a specific behavior. Inconsistent or inaccurate outputs provide mixed signals to the model and detract from its ability to learn a task.
Don't expect an AI model to learn special insights that a human wouldn't be able to detect.
Diversity of Inputs
In addition to output quality, diverse inputs are critical. Today's large language models contain billions of parameters. Ideally, when fine-tuning, you want to align as many of these parameters as possible to your use case. Using a wider variety of inputs for training helps ensure that more parameters get activated during your fine-tuning job.
Here are some ways to think about input data diversity:
Length of input (use a mix of long, medium, and short inputs)
Mix of various tokens including words, characters, numbers, line breaks, formatting, and symbols.
Use examples with spelling mistakes and bad grammar
Mix in different tones, styles, and emotions
Use adversarial prompts (prompts designed to mislead or trick the model)
Cover edge cases
Provide nonsensical inputs and how you want to handle them (could be a blank response)
Having more diverse inputs results in a more "robust" model that can handle anything you throw at it, and is ultimately more important than the quantity of data.
For more information on data quality, see the paper Less is More for Alignment.
Last updated