ml3m.qa: Question Answering

class ml3m.qa.QaMetricEvaluator(dataset: str | Path, save_path: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', bleu_k: list[int] | None = None, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseEvaluator

Evaluator for question-answering via common metrics.

This evaluator supports using the following metric to compare the actual response with the reference answer:

  • BLEU-k (BiLingual Evaluation Understudy)

Parameters

datasetstr or pathlib.Path

The absolute path to the evaluation dataset.

save_pathstr or pathlib.Path

The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the QaMetricEvaluator.evaluate() method.

info_funcCallable

The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.

fmt{“jsonl”, “json”, “csv”}, default=”jsonl”

The format of dataset.

bleu_klist of int or None

The list of k-values used for BLEU-k. Must be positive. If None, use k- values 1 to 4.

logging_mode{“all”, “failed”, “none”}, default=”all”

The logging mode, whether to save the logs of all items, or only of failed items, or save no log.

verboseint, default=0

The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, answer, response = data_item[["question", "answer", "response"]]
    return question, response, answer
evaluate(*, overwrite: bool = False) bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False

Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool

Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame

The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45
>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}
load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame

The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45
>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45
>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45
>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
0    78.0    83.0
1    64.0    76.0
2     NaN     NaN
3   100.0    92.0
4    28.0    38.0
5    30.0    45.0
6     NaN     NaN
class ml3m.qa.QaOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', domain: str | None = None, aspects: list[str] | None = None, aspect_descriptions: dict[str, str] | None = None, n_iter: int = 3, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseOpenAIEvaluator

Evaluator for question-answering via OpenAI.

This evaluator utilizes the ability of OpenAI models to tell the quality of a response from the following aspects:

  • Accuracy: Using the reference answer as the ground truth, does the response include factually incorrect information?

  • Completeness: Compared with the reference answer, is the response missing details?

  • Clarity: Is the response well-organized and clearly presented? If accuracy and completeness is poor, clarity should also be considered poor.

Parameters

datasetstr or pathlib.Path

The absolute path to the evaluation dataset.

save_pathstr or pathlib.Path

The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the QaOpenAIEvaluator.evaluate() method.

openai_configstr or pathlib.Path

The absolute path to the OpenAI configuration file.

info_funcCallable

The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.

fmt{“jsonl”, “json”, “csv”}, default=”jsonl”

The format of dataset.

domainstr, optional

The domain of knowledge. ChatGPT will be prompted to know that your question, answer, and reference answer are “in {domain}”. If None, then this information will not be given to ChatGPT.

aspectslist of str, optional

The aspects to evaluate. If None, evalute accuracy, completeness, and clarity. If there is any string other than “accuracy”, “completeness”, and “clarity”, then they have to be specified in aspect_descriptions.

aspect_descriptionsdict, optional

An optional dictionary mapping aspects to their descriptions. “accuracy”, “completeness”, and “clarity” have default descriptions but can also be overridden by this parameter. Any other aspect, if used in aspects, must exist as a key here.

n_iterint, default=3

The number of iterations for each data item. The mean of the scores for each data item will be taken as the final score.

timeoutfloat, default=60

The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.

modelstr, default=”gpt-3.5-turbo”

The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility

logging_mode{“all”, “failed”, “none”}, default=”all”

The logging mode, whether to save the logs of all items, or only of failed items, or save no log.

verboseint, default=0

The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, answer, response = data_item[["question", "answer", "response"]]
    return question, response, answer
evaluate(*, overwrite: bool = False) bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False

Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool

Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame

The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45
>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}
load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None

The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.

itemslist or None

The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame

The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45
>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45
>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45
>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
0    78.0    83.0
1    64.0    76.0
2     NaN     NaN
3   100.0    92.0
4    28.0    38.0
5    30.0    45.0
6     NaN     NaN