experiments#
- evaluate_experiment(experiment: Experiment, evaluators: Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]] | Sequence[Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]]] | Mapping[str, Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]]], *, dry_run: bool | int = False, print_summary: bool = True, rate_limit_errors: Type[BaseException] | Sequence[Type[BaseException]] | None = None) RanExperiment #
- run_experiment(dataset: Dataset, task: Callable[[Example], Dict[str, Any] | List[Any] | str | int | float | bool | None] | Callable[[Example], Awaitable[Dict[str, Any] | List[Any] | str | int | float | bool | None]], evaluators: Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]] | Sequence[Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]]] | Mapping[str, Evaluator | Callable[[...], EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]] | Callable[[...], Awaitable[EvaluationResult | bool | int | float | str | Tuple[bool | int | float | None, str | None, str | None]]]] | None = None, *, experiment_name: str | None = None, experiment_description: str | None = None, experiment_metadata: Mapping[str, Any] | None = None, rate_limit_errors: Type[BaseException] | Sequence[Type[BaseException]] | None = None, dry_run: bool | int = False, print_summary: bool = True) RanExperiment #
Runs an experiment using a given set of dataset of examples.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns either a boolean or numeric “score”. If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example output: The output of the task expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example
Phoenix also provides pre-built evaluators in the phoenix.experiments.evaluators module.
- Parameters:
dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[Evaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[BaseException | Sequence[BaseException]]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (bool | int) – R the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
- Returns:
- The results of the experiment and evaluation. Additional evaluations can be
added to the experiment using the evaluate_experiment function.
- Return type:
- experiments.evaluators
CoherenceEvaluator
ConcisenessEvaluator
ContainsAllKeywords
ContainsAnyKeyword
ContainsKeyword
HelpfulnessEvaluator
JSONParsable
LLMCriteriaEvaluator
MatchesRegex
RelevanceEvaluator
create_evaluator()
- experiments.evaluators.base
- experiments.evaluators.code_evaluators
- experiments.evaluators.llm_evaluators
- experiments.evaluators.utils
- experiments.functions
- experiments.tracing
- experiments.types
AnnotatorKind
Dataset
EvaluationParameters
EvaluationResult
EvaluationSummary
Example
Experiment
ExperimentEvaluationRun
ExperimentEvaluationRun.annotator_kind
ExperimentEvaluationRun.end_time
ExperimentEvaluationRun.error
ExperimentEvaluationRun.experiment_run_id
ExperimentEvaluationRun.from_dict()
ExperimentEvaluationRun.id
ExperimentEvaluationRun.name
ExperimentEvaluationRun.result
ExperimentEvaluationRun.start_time
ExperimentEvaluationRun.trace_id
ExperimentParameters
ExperimentRun
ExperimentRunOutput
RanExperiment
TaskSummary
TestCase
getrandbits()
- experiments.utils