ml3m.qa: Question Answering

class ml3m.qa.QaMetricEvaluator(dataset: str | Path, save_path: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', bleu_k: list[int] | None = None, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseEvaluator

Evaluator for question-answering via common metrics.

This evaluator supports using the following metric to compare the actual response with the reference answer:

BLEU-k (BiLingual Evaluation Understudy)

Parameters

datasetstr or pathlib.Path: The absolute path to the evaluation dataset.
save_pathstr or pathlib.Path: The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the QaMetricEvaluator.evaluate() method.
info_funcCallable: The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
bleu_klist of int or None: The list of k-values used for BLEU-k. Must be positive. If None, use k- values 1 to 4.
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, answer, response = data_item[["question", "answer", "response"]]
    return question, response, answer

evaluate(*, overwrite: bool = False) → bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False: Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool: Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) → dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame: The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45

>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}

>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}

>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}

load_scores(subject_subset: list | None = None, items: list | None = None) → DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame: The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45

>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45

>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45

>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
  78.0    83.0
  64.0    76.0
   NaN     NaN
 100.0    92.0
  28.0    38.0
  30.0    45.0
   NaN     NaN

class ml3m.qa.QaOpenAIEvaluator(dataset: str | Path, save_path: str | Path, openai_config: str | Path, info_func: Callable[[DataItemType], tuple[str, str, str]], *, fmt: DatasetFormat = 'jsonl', domain: str | None = None, aspects: list[str] | None = None, aspect_descriptions: dict[str, str] | None = None, n_iter: int = 3, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: BaseOpenAIEvaluator

Evaluator for question-answering via OpenAI.

This evaluator utilizes the ability of OpenAI models to tell the quality of a response from the following aspects:

Accuracy: Using the reference answer as the ground truth, does the response include factually incorrect information?
Completeness: Compared with the reference answer, is the response missing details?
Clarity: Is the response well-organized and clearly presented? If accuracy and completeness is poor, clarity should also be considered poor.

Parameters

datasetstr or pathlib.Path: The absolute path to the evaluation dataset.
save_pathstr or pathlib.Path: The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the QaOpenAIEvaluator.evaluate() method.
openai_configstr or pathlib.Path: The absolute path to the OpenAI configuration file.
info_funcCallable: The function that extracts the question, actual answer, and expected answer of a data item. The input parameter should be a pandas.Series, a list, or a dictionary, depending on fmt and the specific type of each data item. The output should be a tuple of three strings, respectively the question, the actual answer to that question, and the expected answer of that question. See the notes for examples.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
domainstr, optional: The domain of knowledge. ChatGPT will be prompted to know that your question, answer, and reference answer are “in {domain}”. If None, then this information will not be given to ChatGPT.
aspectslist of str, optional: The aspects to evaluate. If None, evalute accuracy, completeness, and clarity. If there is any string other than “accuracy”, “completeness”, and “clarity”, then they have to be specified in aspect_descriptions.
aspect_descriptionsdict, optional: An optional dictionary mapping aspects to their descriptions. “accuracy”, “completeness”, and “clarity” have default descriptions but can also be overridden by this parameter. Any other aspect, if used in aspects, must exist as a key here.
n_iterint, default=3: The number of iterations for each data item. The mean of the scores for each data item will be taken as the final score.
timeoutfloat, default=60: The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
modelstr, default=”gpt-3.5-turbo”: The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

Notes

Here are some examples of info_func:

Assume that dataset is in .jsonl format and each line is of the following form: {{"instruction": "xxx", "input": "xxx", "output": "xxx", "history": [], "response": "xxx"}}. Then info_func can be defined as follows:

def info_func(data_item: dict) -> tuple[str, str, str]:
    question = data_item["instruction"] + "\n" + data_item["input"]
    actual = data_item["response"]
    expected = data_item["output"]
    return question, actual, expected

Now assume that dataset is in .csv format with columns “question”, “answer”, and “response”. Then info_func can be defined as follows:

def info_func(data_item: pandas.Series) -> tuple[str, str, str]:
    question, answer, response = data_item[["question", "answer", "response"]]
    return question, response, answer

evaluate(*, overwrite: bool = False) → bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False: Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool: Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) → dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame: The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45

>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}

>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}

>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}

load_scores(subject_subset: list | None = None, items: list | None = None) → DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame: The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45

>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45

>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45

>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
  78.0    83.0
  64.0    76.0
   NaN     NaN
 100.0    92.0
  28.0    38.0
  30.0    45.0
   NaN     NaN