ml3m.base: Base Classes

class ml3m.base.BaseEvaluator(dataset: str | Path, save_path: str | Path, subjects: list, *, fmt: DatasetFormat = 'jsonl', workers: int | list[dict] = 1, n_iter: int = 1, agg_method: AggregateMethod | None = None, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: object

Base evaluator class.

Note

This class is meant to be subclassed. The methods that must be overridden include:

BaseEvaluator._aget_score() (if to be used with multiple workers)
BaseEvaluator._get_score() (if to be used with a single worker)

Parameters

datasetstr or pathlib.Path: The absolute path to the evaluation dataset.
save_pathstr or pathlib.Path: The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the BaseEvaluator.evaluate() method.
subjectslist: The subjects to evaluate. This should strictly correspond to how to scores are obtained in BaseEvaluator._aget_score(). If the score is obtained as a single real value, subjects must be a list of one element and that element will be used as the name of that score. If the score(s) are obtained as a dictionary of subject-value pairs, all items in subjects must appear as the keys of the obtained dictionary. Any additional key will be discarded; any missing key will be treated as an error.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
workersint or list of dict, default=1: If workers is an integer, it will be considered the number of workers. If specified only one worker, the dataset will be processed sequentially, and otherwise it will be asynchronously parallelized. If workers is a list of dictionaries, the length of this list will be considered the number of workers. Each dictionary should be the additional keyword arguments passed to BaseEvaluator._aget_score(). Note that if workers is an integer, no additional keyword arguments will be passed.
n_iterint, default=1: The number of iterations for each data item. This is commonly used when one round of scoring is not convincing enough and the final scoring should be some statistics of multiple rounds of scoring.
agg_method{“mean”, “sum”, “min”, “max”, “mode”}, default=None: The aggregate method to use on multiple rounds of scoring. Ignored when n_iter=1, and otherwise this must not be None.
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

async _aget_score(data_item: DataItemType, **kwargs) → Real | dict[Any, Real][source]

Evaluate a data item and obtain its score(s).

This should be the asynchronous version of BaseEvaluator._get_score(). See BaseEvaluator._get_score() for details. If the obtaining process of the scores does not involving anything asynchrounous, this can simply be overridden as follows:

async def _aget_score(self, data_item, **kwargs):
    return self._get_score(data_item, **kwargs)

However, note that this will cause only one worker to be actually doing all the tasks because it will not give away the control of the event loop.

Note

This method is not implemented and must be overridden in subclasses. Moreover, this method must be defined as asynchronous.

_get_score(data_item: DataItemType, **kwargs) → Real | dict[Any, Real][source]

Evaluate a data item and obtain its score(s).

Note

This method is not implemented and must be overridden in subclasses.

Parameters

data_itemDataItemType: The data item.
kwargs: The additional keyword arguments.

Returns

scoresreal or dict: The evaluated scores, either a single score or a dictionary of subject- score pairs.

Notes

Note that if there are mutliple scores obtained, i.e., returning a dictionary, remember that the “i” key cannot be included since it is reserved for indexing.

Moreover, it is recommended not to catch the exceptions that cause the processing of a data item to fail, since otherwise BaseEvaluator.evaluate() will not realize that the data item errors out.

evaluate(*, overwrite: bool = False) → bool[source]

Evaluate the specified dataset.

Parameters

overwritebool, default=False: Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool: Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) → dict[str, numbers.Real][source]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame: The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45

>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}

>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}

>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}

load_scores(subject_subset: list | None = None, items: list | None = None) → DataFrame[source]

Load the scores from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame: The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45

>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45

>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45

>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
  78.0    83.0
  64.0    76.0
   NaN     NaN
 100.0    92.0
  28.0    38.0
  30.0    45.0
   NaN     NaN

class ml3m.base.BaseOpenAIEvaluator(dataset: str | Path, save_path: str | Path, subjects: list, openai_config: str | Path, *, fmt: DatasetFormat = 'jsonl', n_iter: int = 1, agg_method: AggregateMethod | None = None, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0, **openai_kwargs)[source]

Bases: BaseEvaluator

Base evaluator class via OpenAI.

Note

This class is meant to be subclassed. The methods that must be overriden include:

BaseOpenAIEvaluator._prompt()
BaseOpenAIEvaluator._extract_scores()

Parameters

datasetstr or pathlib.Path: The absolute path to the evaluation dataset.
save_pathstr or pathlib.Path: The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on overwrite when using the BaseOpenAIEvaluator.evaluate() method.
subjectslist: The subjects to evaluate. This should strictly correspond to how to scores are obtained in BaseOpenAIEvaluator._extract_scores(). If the score is obtained as a single real value, subjects must be a list of one element and that element will be used as the name of that score. If the score(s) are obtained as a dictionary of subject-value pairs, all items in subjects must appear as the keys of the obtained dictionary. Any additional key will be discarded; any missing key will be treated as an error.
openai_configstr or pathlib.Path: The absolute path to the OpenAI configuration file.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
n_iterint, default=1: The number of iterations for each data item. This is commonly used when one round of scoring is not convincing enough and the final scoring should be some statistics of multiple rounds of scoring.
agg_method{“mean”, “sum”, “min”, “max”, “mode”}, default=None: The aggregate method to use on multiple rounds of scoring. Ignored when n_iter=1, and otherwise this must not be None.
timeoutfloat, default=60: The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
modelstr, default=”gpt-3.5-turbo”: The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility.
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
openai_kwargs: The additional keyword arguments to pass to OpenAI ChatCompletion. See the arguments marked as Optional at https://platform.openai.com/docs/api-reference/chat/create. Do not try to pass api_key and api_base through openai_kwargs, use a configuration file instead.

_extract_scores(reply: str, data_item: DataItemType) → Real | dict[Any, Real][source]

Extract the score(s) from the OpenAI model reply (and the data item).

Note

This method is not implemented and must be overridden in subclasses.

Parameters

replystr: The OpenAI model reply, from which the score(s) will be extracted.
data_itemDataItemType: The data item. This may not be used, but in case the OpenAI model reply requires comparion with the data item to give the final score, the data item is passed in as well.

Returns

scoresreal or dict: The extracted scores, either a single score or a dictionary of subject- score pairs.

Notes

This method should correspond to the BaseOpenAIEvaluator._prompt() method, in the sense that the formatted evaluation prompt is expected to invoke an extractable model reply, and this method should extract the score(s) from that reply. It can extract either a single score or a dictionary of subject- score pairs.

Note that if there are mutliple scores obtained, i.e., returning a dictionary, remember that the “i” key cannot be included since it is reserved for indexing.

It is recommended not to catch the exceptions that cause the extraction of scores to fail, since otherwise BaseOpenAIEvaluator.evaluate() will not realize that the data item errors out.

_prompt(data_item: DataItemType) → tuple[str, str][source]

Return the prompt for evaluation.

Note

This method is not implemented and must be overridden in subclasses.

Parameters

data_itemDataItemType: The data item.

Returns

sys_msgstr: The system message for setting the role of the OpenAI model when querying for evaluation, e.g. a professional teacher in some field. If no system message is needed, this should be an empty string. See also https://platform.openai.com/docs/guides/gpt/chat-completions-api for an example of system message.
eval_promptstr: The formatted evaluation prompt.

evaluate(*, overwrite: bool = False) → bool

Evaluate the specified dataset.

Parameters

overwritebool, default=False: Whether to overwrite the data in save_path. If False, the evaluation will be built upon existing data in save_path, otherwise all data will be evaluated are existing data will be overwritten.

Returns

completedbool: Whether the task has been completed.

load_avg_score(subject_subset: list | None = None, items: list | None = None) → dict[str, numbers.Real]

Load the average score of each subject from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. However, any item out of range would not be taken into account when computing the average score.

Returns

avg_scorepandas.DataFrame: The average score loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
2,100,92
3,28,38
4,30,45

>>> evaluator.load_avg_score()  
{'score1': 60.0, 'score2': 66.8}

>>> evaluator.load_avg_score(subject_subset=["score2"])  
{'score2': 66.8}

>>> evaluator.load_avg_score(items=list(range(7)))  
{'score1': 60.0, 'score2': 66.8}

load_scores(subject_subset: list | None = None, items: list | None = None) → DataFrame

Load the scores from the save location.

Parameters

subject_subsetlist or None: The subjects of the scores to select, i.e., the columns. If None, select all subjects. In the returned pd.DataFrame, the columns will be in the same order as subject_subset.
itemslist or None: The indices of the items to select. If None, select all items. This will be applied after subject_subset. This does not necessarily need to be a subset of the index of the loaded pd.DataFrame. The indices that do not exist in the index of the loaded pd.DataFrame will be assigned NaN. In the returned pd.DataFrame, the rows will be in the same order as items.

Returns

scorespandas.DataFrame: The scores loaded from save_path.

Examples

Suppose that the file at save_path looks like the following:

i,score1,score2
0,78,83
1,64,76
3,100,92
4,28,38
5,30,45

>>> evaluator.load_scores()  
   score1  score2
i
0      78      83
1      64      76
3     100      92
4      28      38
5      30      45

>>> evaluator.load_scores(subject_subset=["score2"])  
   score2
i
0      83
1      76
3      92
4      38
5      45

>>> evaluator.load_scores(items=list(range(7)))  
   score1  score2
i
  78.0    83.0
  64.0    76.0
   NaN     NaN
 100.0    92.0
  28.0    38.0
  30.0    45.0
   NaN     NaN

class ml3m.base.ResponseGenerator(orig_dataset: str | Path, dataset: str | Path, info_func: Callable[[DataItemType], Any], query_func: Callable[[Any], str], response_name: str, *, fmt: DatasetFormat = 'jsonl', n_workers: int = 1, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]

Bases: object

Generate responses and combine with the original dataset.

Parameters

orig_datasetstr or pathlib.Path: The absolute path to the original dataset.
datasetstr or pathlib.Path: The absolute path to the result dataset. All information in the original dataset will be preserved while the responses will be appended.
info_funcCallable: The function that takes a data item and forms the query. The data item can be a pandas.Series, a list, or a dictionary, depending on format. Whatever it returns will be passed as the input to query_func and printed to console for high verbosity levels.
query_funcCallable: The function that takes the query returned by info_func and outputs the model response represented as a single string. This function should be synchronous if n_workers=1 and asynchronous otherwise.
response_namestr: The key or column name to use for the response. This should not be a key or column name that already exists in the dataset. Be extremely careful since there will be no warning or exception raised on this.
fmt{“jsonl”, “json”, “csv”}, default=”jsonl”: The format of dataset.
n_workersint, default=1: The number of workers. If only one worker, the dataset will be processed sequentially. Otherwise it will be asynchronously parallelized with the specified number of workers.
logging_mode{“all”, “failed”, “none”}, default=”all”: The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
verboseint, default=0: The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.

generate(*, overwrite: bool = False) → bool[source]

Generate responses and combine with the original dataset.

Parameters

overwritebool, default=False: Whether to overwrite the responses if some already exist, specified by response_name.

Returns

completedbool: Whether the task has been completed.