ml3m.base: Base Classes
- class ml3m.base.BaseEvaluator(dataset: str | Path, save_path: str | Path, subjects: list, *, fmt: DatasetFormat = 'jsonl', workers: int | list[dict] = 1, n_iter: int = 1, agg_method: AggregateMethod | None = None, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]
Bases:
object
Base evaluator class.
Note
This class is meant to be subclassed. The methods that must be overridden include:
BaseEvaluator._aget_score()
(if to be used with multiple workers)BaseEvaluator._get_score()
(if to be used with a single worker)
Parameters
- datasetstr or pathlib.Path
The absolute path to the evaluation dataset.
- save_pathstr or pathlib.Path
The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on
overwrite
when using theBaseEvaluator.evaluate()
method.- subjectslist
The subjects to evaluate. This should strictly correspond to how to scores are obtained in
BaseEvaluator._aget_score()
. If the score is obtained as a single real value,subjects
must be a list of one element and that element will be used as the name of that score. If the score(s) are obtained as a dictionary of subject-value pairs, all items insubjects
must appear as the keys of the obtained dictionary. Any additional key will be discarded; any missing key will be treated as an error.- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- workersint or list of dict, default=1
If
workers
is an integer, it will be considered the number of workers. If specified only one worker, the dataset will be processed sequentially, and otherwise it will be asynchronously parallelized. Ifworkers
is a list of dictionaries, the length of this list will be considered the number of workers. Each dictionary should be the additional keyword arguments passed toBaseEvaluator._aget_score()
. Note that ifworkers
is an integer, no additional keyword arguments will be passed.- n_iterint, default=1
The number of iterations for each data item. This is commonly used when one round of scoring is not convincing enough and the final scoring should be some statistics of multiple rounds of scoring.
- agg_method{“mean”, “sum”, “min”, “max”, “mode”}, default=None
The aggregate method to use on multiple rounds of scoring. Ignored when
n_iter=1
, and otherwise this must not beNone
.- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
- async _aget_score(data_item: DataItemType, **kwargs) Real | dict[Any, Real] [source]
Evaluate a data item and obtain its score(s).
This should be the asynchronous version of
BaseEvaluator._get_score()
. SeeBaseEvaluator._get_score()
for details. If the obtaining process of the scores does not involving anything asynchrounous, this can simply be overridden as follows:async def _aget_score(self, data_item, **kwargs): return self._get_score(data_item, **kwargs)
However, note that this will cause only one worker to be actually doing all the tasks because it will not give away the control of the event loop.
Note
This method is not implemented and must be overridden in subclasses. Moreover, this method must be defined as asynchronous.
- _get_score(data_item: DataItemType, **kwargs) Real | dict[Any, Real] [source]
Evaluate a data item and obtain its score(s).
Note
This method is not implemented and must be overridden in subclasses.
Parameters
- data_itemDataItemType
The data item.
- kwargs
The additional keyword arguments.
Returns
- scoresreal or dict
The evaluated scores, either a single score or a dictionary of subject- score pairs.
Notes
Note that if there are mutliple scores obtained, i.e., returning a dictionary, remember that the “i” key cannot be included since it is reserved for indexing.
Moreover, it is recommended not to catch the exceptions that cause the processing of a data item to fail, since otherwise
BaseEvaluator.evaluate()
will not realize that the data item errors out.
- evaluate(*, overwrite: bool = False) bool [source]
Evaluate the specified dataset.
Parameters
- overwritebool, default=False
Whether to overwrite the data in
save_path
. IfFalse
, the evaluation will be built upon existing data insave_path
, otherwise all data will be evaluated are existing data will be overwritten.
Returns
- completedbool
Whether the task has been completed.
- load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real] [source]
Load the average score of each subject from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. However, any item out of range would not be taken into account when computing the average score.
Returns
- avg_scorepandas.DataFrame
The average score loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 2,100,92 3,28,38 4,30,45
>>> evaluator.load_avg_score() {'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"]) {'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7))) {'score1': 60.0, 'score2': 66.8}
- load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame [source]
Load the scores from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects. In the returnedpd.DataFrame
, the columns will be in the same order assubject_subset
.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. The indices that do not exist in the index of the loadedpd.DataFrame
will be assignedNaN
. In the returnedpd.DataFrame
, the rows will be in the same order asitems
.
Returns
- scorespandas.DataFrame
The scores loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 3,100,92 4,28,38 5,30,45
>>> evaluator.load_scores() score1 score2 i 0 78 83 1 64 76 3 100 92 4 28 38 5 30 45
>>> evaluator.load_scores(subject_subset=["score2"]) score2 i 0 83 1 76 3 92 4 38 5 45
>>> evaluator.load_scores(items=list(range(7))) score1 score2 i 0 78.0 83.0 1 64.0 76.0 2 NaN NaN 3 100.0 92.0 4 28.0 38.0 5 30.0 45.0 6 NaN NaN
- class ml3m.base.BaseOpenAIEvaluator(dataset: str | Path, save_path: str | Path, subjects: list, openai_config: str | Path, *, fmt: DatasetFormat = 'jsonl', n_iter: int = 1, agg_method: AggregateMethod | None = None, timeout: float = 60, model: str = 'gpt-3.5-turbo', logging_mode: LoggingMode = 'all', verbose: int = 0, **openai_kwargs)[source]
Bases:
BaseEvaluator
Base evaluator class via OpenAI.
Note
This class is meant to be subclassed. The methods that must be overriden include:
Parameters
- datasetstr or pathlib.Path
The absolute path to the evaluation dataset.
- save_pathstr or pathlib.Path
The absolute path to the save location. This path may or may not exist, and if it exists, its file contents will be treated as a (partially) written result. Whether to overwrite the existing results or to build on them depend on
overwrite
when using theBaseOpenAIEvaluator.evaluate()
method.- subjectslist
The subjects to evaluate. This should strictly correspond to how to scores are obtained in
BaseOpenAIEvaluator._extract_scores()
. If the score is obtained as a single real value,subjects
must be a list of one element and that element will be used as the name of that score. If the score(s) are obtained as a dictionary of subject-value pairs, all items insubjects
must appear as the keys of the obtained dictionary. Any additional key will be discarded; any missing key will be treated as an error.- openai_configstr or pathlib.Path
The absolute path to the OpenAI configuration file.
- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- n_iterint, default=1
The number of iterations for each data item. This is commonly used when one round of scoring is not convincing enough and the final scoring should be some statistics of multiple rounds of scoring.
- agg_method{“mean”, “sum”, “min”, “max”, “mode”}, default=None
The aggregate method to use on multiple rounds of scoring. Ignored when
n_iter=1
, and otherwise this must not beNone
.- timeoutfloat, default=60
The timeout in seconds. This is not the OpenAI timeout, but the timeout for cancelling the worker tasks.
- modelstr, default=”gpt-3.5-turbo”
The ID of the model to use, must be one of the available OpenAI models that support the ChatCompletion API. See also https://platform.openai.com/docs/models/model-endpoint-compatibility.
- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.
- openai_kwargs
The additional keyword arguments to pass to OpenAI ChatCompletion. See the arguments marked as Optional at https://platform.openai.com/docs/api-reference/chat/create. Do not try to pass
api_key
andapi_base
throughopenai_kwargs
, use a configuration file instead.
- _extract_scores(reply: str, data_item: DataItemType) Real | dict[Any, Real] [source]
Extract the score(s) from the OpenAI model reply (and the data item).
Note
This method is not implemented and must be overridden in subclasses.
Parameters
- replystr
The OpenAI model reply, from which the score(s) will be extracted.
- data_itemDataItemType
The data item. This may not be used, but in case the OpenAI model reply requires comparion with the data item to give the final score, the data item is passed in as well.
Returns
- scoresreal or dict
The extracted scores, either a single score or a dictionary of subject- score pairs.
Notes
This method should correspond to the
BaseOpenAIEvaluator._prompt()
method, in the sense that the formatted evaluation prompt is expected to invoke an extractable model reply, and this method should extract the score(s) from that reply. It can extract either a single score or a dictionary of subject- score pairs.Note that if there are mutliple scores obtained, i.e., returning a dictionary, remember that the “i” key cannot be included since it is reserved for indexing.
It is recommended not to catch the exceptions that cause the extraction of scores to fail, since otherwise
BaseOpenAIEvaluator.evaluate()
will not realize that the data item errors out.
- _prompt(data_item: DataItemType) tuple[str, str] [source]
Return the prompt for evaluation.
Note
This method is not implemented and must be overridden in subclasses.
Parameters
- data_itemDataItemType
The data item.
Returns
- sys_msgstr
The system message for setting the role of the OpenAI model when querying for evaluation, e.g. a professional teacher in some field. If no system message is needed, this should be an empty string. See also https://platform.openai.com/docs/guides/gpt/chat-completions-api for an example of system message.
- eval_promptstr
The formatted evaluation prompt.
- evaluate(*, overwrite: bool = False) bool
Evaluate the specified dataset.
Parameters
- overwritebool, default=False
Whether to overwrite the data in
save_path
. IfFalse
, the evaluation will be built upon existing data insave_path
, otherwise all data will be evaluated are existing data will be overwritten.
Returns
- completedbool
Whether the task has been completed.
- load_avg_score(subject_subset: list | None = None, items: list | None = None) dict[str, numbers.Real]
Load the average score of each subject from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. However, any item out of range would not be taken into account when computing the average score.
Returns
- avg_scorepandas.DataFrame
The average score loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 2,100,92 3,28,38 4,30,45
>>> evaluator.load_avg_score() {'score1': 60.0, 'score2': 66.8}
>>> evaluator.load_avg_score(subject_subset=["score2"]) {'score2': 66.8}
>>> evaluator.load_avg_score(items=list(range(7))) {'score1': 60.0, 'score2': 66.8}
- load_scores(subject_subset: list | None = None, items: list | None = None) DataFrame
Load the scores from the save location.
Parameters
- subject_subsetlist or None
The subjects of the scores to select, i.e., the columns. If
None
, select all subjects. In the returnedpd.DataFrame
, the columns will be in the same order assubject_subset
.- itemslist or None
The indices of the items to select. If
None
, select all items. This will be applied aftersubject_subset
. This does not necessarily need to be a subset of the index of the loadedpd.DataFrame
. The indices that do not exist in the index of the loadedpd.DataFrame
will be assignedNaN
. In the returnedpd.DataFrame
, the rows will be in the same order asitems
.
Returns
- scorespandas.DataFrame
The scores loaded from
save_path
.
Examples
Suppose that the file at
save_path
looks like the following:i,score1,score2 0,78,83 1,64,76 3,100,92 4,28,38 5,30,45
>>> evaluator.load_scores() score1 score2 i 0 78 83 1 64 76 3 100 92 4 28 38 5 30 45
>>> evaluator.load_scores(subject_subset=["score2"]) score2 i 0 83 1 76 3 92 4 38 5 45
>>> evaluator.load_scores(items=list(range(7))) score1 score2 i 0 78.0 83.0 1 64.0 76.0 2 NaN NaN 3 100.0 92.0 4 28.0 38.0 5 30.0 45.0 6 NaN NaN
- class ml3m.base.ResponseGenerator(orig_dataset: str | Path, dataset: str | Path, info_func: Callable[[DataItemType], Any], query_func: Callable[[Any], str], response_name: str, *, fmt: DatasetFormat = 'jsonl', n_workers: int = 1, logging_mode: LoggingMode = 'all', verbose: int = 0)[source]
Bases:
object
Generate responses and combine with the original dataset.
Parameters
- orig_datasetstr or pathlib.Path
The absolute path to the original dataset.
- datasetstr or pathlib.Path
The absolute path to the result dataset. All information in the original dataset will be preserved while the responses will be appended.
- info_funcCallable
The function that takes a data item and forms the query. The data item can be a
pandas.Series
, a list, or a dictionary, depending onformat
. Whatever it returns will be passed as the input toquery_func
and printed to console for high verbosity levels.- query_funcCallable
The function that takes the query returned by
info_func
and outputs the model response represented as a single string. This function should be synchronous ifn_workers=1
and asynchronous otherwise.- response_namestr
The key or column name to use for the response. This should not be a key or column name that already exists in the dataset. Be extremely careful since there will be no warning or exception raised on this.
- fmt{“jsonl”, “json”, “csv”}, default=”jsonl”
The format of
dataset
.- n_workersint, default=1
The number of workers. If only one worker, the dataset will be processed sequentially. Otherwise it will be asynchronously parallelized with the specified number of workers.
- logging_mode{“all”, “failed”, “none”}, default=”all”
The logging mode, whether to save the logs of all items, or only of failed items, or save no log.
- verboseint, default=0
The verbosity level of the processing. For negative levels, only a progress bar will be displayed. For level 0, the errored items will also be displayed. For positive levels, the all items will be displayed, and the verbosity level determines the number of lines to display for the message of each item.