Evaluating Model Performance

The second step to evaluate the performance of an LLM is choose a suitable evaluation method to either score the model responses directly, or to compare the model responses with the reference answers to obtain final scores.

ml3m provides the functionality to score model responses store them in a csv file. The relevant classes are:

`base.BaseEvaluator`	Base evaluator class.
`base.BaseOpenAIEvaluator`	Base evaluator class via OpenAI.
`mcq.McqOpenAIEvaluator`	Evaluator for multiple-choice questions via OpenAI.
`qa.QaMetricEvaluator`	Evaluator for question-answering via common metrics.
`qa.QaOpenAIEvaluator`	Evaluator for question-answering via OpenAI.

The evaluators in the ml3m.base module are intended to be subclassed and cannot be used directly, which we will not discuss for now. Yet all evaluators follow a similar API. To use them, you will need to prepare an evaluation dataset dataset with model responses and a saving location save_path for storing the scores. Here, dataset exactly corresponds to that generated by a response generator. The accepted formats are also the same, as described here.

Defining the Information Function

All evaluators provided by ml3m requires a parameter info_func, which should be a function that accepts a data item and returns the question/query, the model response, and the reference answer (if required by the evaluator). For details on what to return, you should refer to the API documentation.

All these information should be directly extractable from a data item, meaning that it is recommended to use a response generator to generate the model responses in advance, instead of generating answers while scoring them.