Evaluation

In order to be confident that you're making progress toward a better model, you need to test its outputs at each step of the prompt engineering or fine-tuning process. To help with this, Entry Point AI runs an evaluation (known as "evals") each time you apply a template to a model or fine-tune a model.

There are two key steps in an evaluation:

  1. Feed your validation examples through the model to get their outputs

  2. Score the outputs relative to your examples

Step 1 happens automatically. Step 2 can be handleded in a few different ways depending on your preferences.

Entry Point provides (3) different scoring methods you can select from.

You can select your desired scoring method by navigating to settings and setting the dropdown under the Evaluations section.

Exact Match

Exact match is a simple comparison to see if the model's output matches the example's output. This scoring method is best suited for classifiers with objective outputs that can be considered each "right" or "wrong."

For example, if you are fine-tuning a model to accurately classify incoming support requests, exact match scoring would be a good choice. This method would allow you to score the outputs based on whether or not the model correctly identified the category.

Manual

The manual scoring method allows humans to review and score each output. While it is the most time-consuming method, it can be a good choice for highly subjective evaluations.

For example, if you are fine-tuning a model to write marketing copy in your specific brand voice, manual evaluation may be the best choice as it allows to you to rank your outputs based on your own intuition.

Predictive

Predictive scoring (often referred to as "LLM-as-a-judge") leverages a large language model of your choice to score outputs. Using an LLM for this purpose allows you to approximate the human intuition of manual scoring in far less time.

When using this method, the model that you have selected as your judge will display both its score as well as its reasoning for arriving at this score.

Entry Point provides several ways to customize predictive scoring:

  • Model - select the platform and model to use for scoring

  • Criteria - choose which aspects of the output the score should be based upon

  • Additional Instructions - write your own advanced instructions for the model to consider when scoring

Predictive scoring on Entry Point also has a unique capability to indicate if there are issues with your example data.

The underlying assumption behind evals is that you have good validation examples with which to test your model. However, this is not always true; every dataset has room for improvement.

The evaluation process can help you uncover and resolve larger issues with your dataset leading to far better results overall.

Last updated