experiments#
phoenix.experiments#
- evaluate_experiment(experiment, evaluators, *, dry_run=False, print_summary=True, rate_limit_errors=None)#
- run_experiment(dataset, task, evaluators=None, *, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True)#
Runs an experiment using a given set of dataset of examples.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns either a boolean or numeric “score”. If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example output: The output of the task expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example
Phoenix also provides pre-built evaluators in the phoenix.experiments.evaluators module.
- Parameters:
dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[Evaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[BaseException | Sequence[BaseException]]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (bool | int) – R the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
- Returns:
- The results of the experiment and evaluation. Additional evaluations can be
added to the experiment using the evaluate_experiment function.
- Return type:
RanExperiment
phoenix.experiments.evaluators#
code_evaluators#
- class ContainsAllKeywords(*args, **kwargs)#
Bases:
CodeEvaluator
- class ContainsAnyKeyword(*args, **kwargs)#
Bases:
CodeEvaluator
- class ContainsKeyword(*args, **kwargs)#
Bases:
CodeEvaluator
- class JSONParsable(*args, **kwargs)#
Bases:
CodeEvaluator
- class MatchesRegex(*args, **kwargs)#
Bases:
CodeEvaluator
llm_evaluators#
- class CoherenceEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
- class ConcisenessEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
- class HelpfulnessEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
- class LLMCriteriaEvaluator(*args, **kwargs)#
Bases:
LLMEvaluator
- class RelevanceEvaluator(*args, **kwargs)#
Bases:
LLMEvaluator
utils#
- create_evaluator(kind=AnnotatorKind.CODE, name=None, scorer=None)#