Evaluating Model Performance
============================

The second step to evaluate the performance of an LLM is choose a suitable evaluation
method to either score the model responses directly, or to compare the model responses
with the reference answers to obtain final scores.


ml3m provides the functionality to score model responses store them in a csv file. The
relevant classes are:

.. currentmodule:: ml3m

.. autosummary::
    :nosignatures:

    base.BaseEvaluator
    base.BaseOpenAIEvaluator
    mcq.McqOpenAIEvaluator
    qa.QaMetricEvaluator
    qa.QaOpenAIEvaluator

The evaluators in the :mod:`ml3m.base` module are intended to be subclassed and cannot
be used directly, which we will not discuss for now. Yet all evaluators follow a
similar API. To use them, you will need to prepare an evaluation dataset ``dataset``
with model responses and a saving location ``save_path`` for storing the scores. Here,
``dataset`` exactly corresponds to that generated by a
:ref:`response generator <generators>`. The accepted formats are also the same, as
described :ref:`here <Dataset Format>`.

Defining the Information Function
---------------------------------

All evaluators provided by ml3m requires a parameter ``info_func``, which should be a
function that accepts a data item and returns the question/query, the model response,
and the reference answer (if required by the evaluator). For details on what to return,
you should refer to the API documentation.

All these information should be directly *extractable* from a data item, meaning that
it is recommended to use a :ref:`response generator <generators>` to generate the model
responses in advance, instead of generating answers *while* scoring them.