ml3m.mcq: Multiple Choice Questions

class ml3m.mcq.McqOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', score_name: str = 'score', label_type: Literal['upper', 'lower', 'digit'] = 'upper', label_cnt: int = 4, setting: str | None = None, n_iter: int = 1, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseOpenAIEvaluator

Evaluator for multiple-choice questions via OpenAI.

This evaluator utilizes the ability of OpenAI models to tell if a response selects the correct options, based on the reference answer. The score for each data item would be either 0 or 100, and there will be no partial credits.

Parameters

datasetstr or pathlib.Path: The absolute path to the evaluation dataset.
save_pathstr or pathlib.Path: The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the McqOpenAIEvaluator.evaluate() method.
openai_configstr or pathlib.Path: The absolute path to the OpenAI configuration file.
info_funcCallable: The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
score_namestr, default=”score”: The key/column name to use for the obtained score. This should not be a key or column name that already exists in the save location. Be extremely careful since there will be no warning or exception raised on this.
label_type{“upper”, “lower”, “digit”}, default=”upper”: The type of the option labels. “upper” stands for A, B, C, D, … “lower” stands for a, b, c, d, … “digit” stands for 1, 2, 3, 4, …
label_cntint, default=4: The number of options. For instance, label_type="upper" with label_cnt=4 means that the option labels are A, B, C, and D.
setting: str, optional: The personality setting for the OpenAI model, passed as the system message. If None, then no system message is used.
n_iterint, default=1: The number of iterations for each data item. The mode of the scores for each data item will be taken as the final score.
timeoutfloat, default=60: The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
modelstr, default=”gpt-3.5-turbo”: The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “A”, “B”, “C”, “D”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, A, B, C, D, answer, response = data_item[
        ["question", "A", "B", "C", "D", "answer", "response"]
    ]
    formatted_question = (
        f"{{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}"
    )
    return formatted_question, response, answer

evaluate(*, overwrite: bool = False) → bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False: Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool: Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) → dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame: The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45

>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}

>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}

>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}

load_scores(subject_subset: list | None = None, items: list | None = None) → DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame: The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45

>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45

>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45

>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
  78.0    83.0
  64.0    76.0
   NaN     NaN
 100.0    92.0
  28.0    38.0
  30.0    45.0
   NaN     NaN