ml3m.mcq: Multiple Choice Questions
- class ml3m.mcq.McqOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', score_name: str = 'score', label_type: Literal['upper', 'lower', 'digit'] = 'upper', label_cnt: int = 4, setting: str | None = None, n_iter: int = 1, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]
Bases:
BaseOpenAIEvaluator
Evaluator for multiple-choice questions via OpenAI.
This evaluator utilizes the ability of OpenAI models to tell if a response selects the correct options, based on the reference answer. The score for each data item would be either 0 or 100, and there will be no partial credits.
Parameters
- datasetstr or pathlib.Path
The absolute path to the evaluation dataset.
- save_pathstr or pathlib.Path
The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on
overwrite
when using theMcqOpenAIEvaluator.evaluate()
method.- openai_configstr or pathlib.Path
The absolute path to the OpenAI configuration file.
- info_funcCallable
The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a
pandas.Series
, a list, or a dictionary, depending onfmt
and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- score_namestr, default=”score”
The key/column name to use for the obtained score. This should not be a key or column name that already exists in the save location. Be extremely careful since there will be no warning or exception raised on this.
- label_type{“upper”, “lower”, “digit”}, default=”upper”
The type of the option labels. “upper” stands for A, B, C, D, … “lower” stands for a, b, c, d, … “digit” stands for 1, 2, 3, 4, …
- label_cntint, default=4
The number of options. For instance,
label_type="upper"
withlabel_cnt=4
means that the option labels are A, B, C, and D.- setting: str, optional
The personality setting for the OpenAI model, passed as the system message. If
None
, then no system message is used.- n_iterint, default=1
The number of iterations for each data item. The mode of the scores for each data item will be taken as the final score.
- timeoutfloat, default=60
The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
- modelstr, default=”gpt-3.5-turbo”
The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility
- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
Notes
Here are some examples of
info_func
:Assume that
dataset
is in.jsonl
format and each line is of the following form:{{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}
. Theninfo_func
can be defined as follows:def info_func(data_item: dict) -> tuple[str, str, str]: question = data_item["instruction"] + "\n" + data_item["input"] actual = data_item["response"] expected = data_item["output"] return question, actual, expected
Now assume that
dataset
is in.csv
format with columns “question”, “A”, “B”, “C”, “D”, “answer”, and “response”. Theninfo_func
can be defined as follows:def info_func(data_item: pandas.Series) -> tuple[str, str, str]: question, A, B, C, D, answer, response = data_item[ ["question", "A", "B", "C", "D", "answer", "response"] ] formatted_question = ( f"{{question}}\nA. {{A}}\nB. {{B}}\nC. {{C}}\nD. {{D}}" ) return formatted_question, response, answer
- evaluate(*, overwrite: bool = False) bool
Evaluate the specified dataset.
Parameters
- overwritebool, default=False
Whether to overwrite the data in
save_path
. IfFalse
, the evaluation will be built upon existing data insave_path
, otherwise all data will be evaluated are existing data will be overwritten.
Returns
- completedbool
Whether the task has been completed.
- load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]
Load the average score of each subject from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. However, any item out of range would not be taken into account when computing the average score.
Returns
- avg_scorepandas.DataFrame
The average score loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 2,100,92 3,28,38 4,30,45
>>> evaluator.load_avg_score() {'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"]) {'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7))) {'score1': 60.0, 'score2': 66.8}
- load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame
Load the scores from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects. In the returnedpd.DataFrame
, the columns will be in the same order assubject_subset
.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. The indices that do not exist in the index of the loadedpd.DataFrame
will be assignedNaN
. In the returnedpd.DataFrame
, the rows will be in the same order asitems
.
Returns
- scorespandas.DataFrame
The scores loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 3,100,92 4,28,38 5,30,45
>>> evaluator.load_scores() score1 score2 i 0 78 83 1 64 76 3 100 92 4 28 38 5 30 45
>>> evaluator.load_scores(subject_subset=["score2"]) score2 i 0 83 1 76 3 92 4 38 5 45
>>> evaluator.load_scores(items=list(range(7))) score1 score2 i 0 78.0 83.0 1 64.0 76.0 2 NaN NaN 3 100.0 92.0 4 28.0 38.0 5 30.0 45.0 6 NaN NaN