Evaluating Model Performance ============================ The second step to evaluate the performance of an LLM is choose a suitable evaluation method to either score the model responses directly, or to compare the model responses with the reference answers to obtain final scores. ml3m provides the functionality to score model responses store them in a csv file. The relevant classes are: .. currentmodule:: ml3m .. autosummary:: :nosignatures: base.BaseEvaluator base.BaseOpenAIEvaluator mcq.McqOpenAIEvaluator qa.QaMetricEvaluator qa.QaOpenAIEvaluator The evaluators in the :mod:`ml3m.base` module are intended to be subclassed and cannot be used directly, which we will not discuss for now. Yet all evaluators follow a similar API. To use them, you will need to prepare an evaluation dataset ``dataset`` with model responses and a saving location ``save_path`` for storing the scores. Here, ``dataset`` exactly corresponds to that generated by a :ref:`response generator `. The accepted formats are also the same, as described :ref:`here `. Defining the Information Function --------------------------------- All evaluators provided by ml3m requires a parameter ``info_func``, which should be a function that accepts a data item and returns the question/query, the model response, and the reference answer (if required by the evaluator). For details on what to return, you should refer to the API documentation. All these information should be directly *extractable* from a data item, meaning that it is recommended to use a :ref:`response generator ` to generate the model responses in advance, instead of generating answers *while* scoring them.