ml3m.mcq: Multiple Choice Questions

class ml3m.mcq.McqOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', score_name: str = 'score', label_type: Literal['upper', 'lower', 'digit'] = 'upper', label_cnt: int = 4, setting: str | None = None, n_iter: int = 1, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseOpenAIEvaluator

Evaluator for multiple-choice questions via OpenAI.

This evaluator utilizes the ability of OpenAI models to tell if a response selects the correct options, based on the reference answer. The score for each data item would be either 0 or 100, and there will be no partial credits.

Parameters

datasetstr or pathlib.Path

The absolute path to the evaluation dataset.

save_pathstr or pathlib.Path

The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the McqOpenAIEvaluator.evaluate() method.

openai_configstr or pathlib.Path

The absolute path to the OpenAI configuration file.

info_funcCallable

The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.

fmt{“jsonl”, “json”, “csv”}, default=”jsonl”

The format of dataset.

score_namestr, default=”score”

The key/column name to use for the obtained score. This should not be a key or column name that already exists in the save location. Be extremely careful since there will be no warning or exception raised on this.

label_type{“upper”, “lower”, “digit”}, default=”upper”

The type of the option labels. “upper” stands for A, B, C, D, … “lower” stands for a, b, c, d, … “digit” stands for 1, 2, 3, 4, …

label_cntint, default=4

The number of options. For instance, label_type="upper" with label_cnt=4 means that the option labels are A, B, C, and D.

setting: str, optional

The personality setting for the OpenAI model, passed as the system message. If None, then no system message is used.

n_iterint, default=1

The number of iterations for each data item. The mode of the scores for each data item will be taken as the final score.

timeoutfloat, default=60

The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.

modelstr, default=”gpt-3.5-turbo”

The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility

logging_mode{“all”, “failed”, “none”}, default=”all”

The logging mode, whether to save the logs of all items, or only of failed items, or save no log.

verboseint, default=0

The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “A”, “B”, “C”, “D”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, A, B, C, D, answer, response = data_item[
        ["question", "A", "B", "C", "D", "answer", "response"]
    ]
    formatted_question = (
        f"{{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}"
    )
    return formatted_question, response, answer
evaluate(*, overwrite: bool = False) bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False

Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool

Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame

The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45
>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}
load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame

The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45
>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45
>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45
>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
0    78.0    83.0
1    64.0    76.0
2     NaN     NaN
3   100.0    92.0
4    28.0    38.0
5    30.0    45.0
6     NaN     NaN