ml3m.qa: Question Answering
- class ml3m.qa.QaMetricEvaluator(dataset: str | Path, save_path: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', bleu_k: list[int] | None = None, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]
Bases:
BaseEvaluator
Evaluator for question-answering via common metrics.
This evaluator supports using the following metric to compare the actual response with the reference answer:
BLEU-k (BiLingual Evaluation Understudy)
Parameters
- datasetstr or pathlib.Path
The absolute path to the evaluation dataset.
- save_pathstr or pathlib.Path
The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on
overwrite
when using theQaMetricEvaluator.evaluate()
method.- info_funcCallable
The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a
pandas.Series
, a list, or a dictionary, depending onfmt
and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- bleu_klist of int or None
The list of k-values used for BLEU-k. Must be positive. If
None
, use k- values 1 to 4.- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
Notes
Here are some examples of
info_func
:Assume that
dataset
is in.jsonl
format and each line is of the following form:{{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}
. Theninfo_func
can be defined as follows:def info_func(data_item: dict) -> tuple[str, str, str]: question = data_item["instruction"] + "\n" + data_item["input"] actual = data_item["response"] expected = data_item["output"] return question, actual, expected
Now assume that
dataset
is in.csv
format with columns “question”, “answer”, and “response”. Theninfo_func
can be defined as follows:def info_func(data_item: pandas.Series) -> tuple[str, str, str]: question, answer, response = data_item[["question", "answer", "response"]] return question, response, answer
- evaluate(*, overwrite: bool = False) bool
Evaluate the specified dataset.
Parameters
- overwritebool, default=False
Whether to overwrite the data in
save_path
. IfFalse
, the evaluation will be built upon existing data insave_path
, otherwise all data will be evaluated are existing data will be overwritten.
Returns
- completedbool
Whether the task has been completed.
- load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]
Load the average score of each subject from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. However, any item out of range would not be taken into account when computing the average score.
Returns
- avg_scorepandas.DataFrame
The average score loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 2,100,92 3,28,38 4,30,45
>>> evaluator.load_avg_score() {'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"]) {'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7))) {'score1': 60.0, 'score2': 66.8}
- load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame
Load the scores from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects. In the returnedpd.DataFrame
, the columns will be in the same order assubject_subset
.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. The indices that do not exist in the index of the loadedpd.DataFrame
will be assignedNaN
. In the returnedpd.DataFrame
, the rows will be in the same order asitems
.
Returns
- scorespandas.DataFrame
The scores loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 3,100,92 4,28,38 5,30,45
>>> evaluator.load_scores() score1 score2 i 0 78 83 1 64 76 3 100 92 4 28 38 5 30 45
>>> evaluator.load_scores(subject_subset=["score2"]) score2 i 0 83 1 76 3 92 4 38 5 45
>>> evaluator.load_scores(items=list(range(7))) score1 score2 i 0 78.0 83.0 1 64.0 76.0 2 NaN NaN 3 100.0 92.0 4 28.0 38.0 5 30.0 45.0 6 NaN NaN
- class ml3m.qa.QaOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', domain: str | None = None, aspects: list[str] | None = None, aspect_descriptions: dict[str, str] | None = None, n_iter: int = 3, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]
Bases:
BaseOpenAIEvaluator
Evaluator for question-answering via OpenAI.
This evaluator utilizes the ability of OpenAI models to tell the quality of a response from the following aspects:
Accuracy: Using the reference answer as the ground truth, does the response include factually incorrect information?
Completeness: Compared with the reference answer, is the response missing details?
Clarity: Is the response well-organized and clearly presented? If accuracy and completeness is poor, clarity should also be considered poor.
Parameters
- datasetstr or pathlib.Path
The absolute path to the evaluation dataset.
- save_pathstr or pathlib.Path
The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on
overwrite
when using theQaOpenAIEvaluator.evaluate()
method.- openai_configstr or pathlib.Path
The absolute path to the OpenAI configuration file.
- info_funcCallable
The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a
pandas.Series
, a list, or a dictionary, depending onfmt
and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- domainstr, optional
The domain of knowledge. ChatGPT will be prompted to know that your question, answer, and reference answer are “in {domain}”. If
None
, then this information will not be given to ChatGPT.- aspectslist of str, optional
The aspects to evaluate. If
None
, evalute accuracy, completeness, and clarity. If there is any string other than “accuracy”, “completeness”, and “clarity”, then they have to be specified inaspect_descriptions
.- aspect_descriptionsdict, optional
An optional dictionary mapping aspects to their descriptions. “accuracy”, “completeness”, and “clarity” have default descriptions but can also be overridden by this parameter. Any other aspect, if used in
aspects
, must exist as a key here.- n_iterint, default=3
The number of iterations for each data item. The mean of the scores for each data item will be taken as the final score.
- timeoutfloat, default=60
The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
- modelstr, default=”gpt-3.5-turbo”
The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility
- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
Notes
Here are some examples of
info_func
:Assume that
dataset
is in.jsonl
format and each line is of the following form:{{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}
. Theninfo_func
can be defined as follows:def info_func(data_item: dict) -> tuple[str, str, str]: question = data_item["instruction"] + "\n" + data_item["input"] actual = data_item["response"] expected = data_item["output"] return question, actual, expected
Now assume that
dataset
is in.csv
format with columns “question”, “answer”, and “response”. Theninfo_func
can be defined as follows:def info_func(data_item: pandas.Series) -> tuple[str, str, str]: question, answer, response = data_item[["question", "answer", "response"]] return question, response, answer
- evaluate(*, overwrite: bool = False) bool
Evaluate the specified dataset.
Parameters
- overwritebool, default=False
Whether to overwrite the data in
save_path
. IfFalse
, the evaluation will be built upon existing data insave_path
, otherwise all data will be evaluated are existing data will be overwritten.
Returns
- completedbool
Whether the task has been completed.
- load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]
Load the average score of each subject from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. However, any item out of range would not be taken into account when computing the average score.
Returns
- avg_scorepandas.DataFrame
The average score loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 2,100,92 3,28,38 4,30,45
>>> evaluator.load_avg_score() {'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"]) {'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7))) {'score1': 60.0, 'score2': 66.8}
- load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame
Load the scores from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects. In the returnedpd.DataFrame
, the columns will be in the same order assubject_subset
.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. The indices that do not exist in the index of the loadedpd.DataFrame
will be assignedNaN
. In the returnedpd.DataFrame
, the rows will be in the same order asitems
.
Returns
- scorespandas.DataFrame
The scores loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 3,100,92 4,28,38 5,30,45
>>> evaluator.load_scores() score1 score2 i 0 78 83 1 64 76 3 100 92 4 28 38 5 30 45
>>> evaluator.load_scores(subject_subset=["score2"]) score2 i 0 83 1 76 3 92 4 38 5 45
>>> evaluator.load_scores(items=list(range(7))) score1 score2 i 0 78.0 83.0 1 64.0 76.0 2 NaN NaN 3 100.0 92.0 4 28.0 38.0 5 30.0 45.0 6 NaN NaN