experiments#
phoenix.experiments#
- evaluate_experiment(experiment, evaluators, *, dry_run=False, print_summary=True, rate_limit_errors=None)#
- run_experiment(dataset, task, evaluators=None, *, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True)#
Runs an experiment using a given set of dataset of examples.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns either a boolean or numeric “score”. If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
Phoenix also provides pre-built evaluators in the phoenix.experiments.evaluators module.
- Parameters:
dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[Evaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[BaseException | Sequence[BaseException]]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (bool | int) – R the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
- Returns:
- The results of the experiment and evaluation. Additional evaluations can be
added to the experiment using the evaluate_experiment function.
- Return type:
RanExperiment
phoenix.experiments.evaluators#
code_evaluators#
- class ContainsAllKeywords(*args, **kwargs)#
Bases:
CodeEvaluator
An evaluator that checks if all of the keywords are present in the output of an experiment run.
- Parameters:
keywords (List[str]) – The keywords to search for in the output.
name (str, optional) – An optional name for the evaluator. Defaults to “ContainsAll(<keywords>)”.
Example
from phoenix.experiments import run_experiment from phoenix.experiments.evaluators import ContainsAllKeywords run_experiment(dataset, task, evaluators=[ContainsAllKeywords(["foo", "bar"])])
- class ContainsAnyKeyword(*args, **kwargs)#
Bases:
CodeEvaluator
An evaluator that checks if any of the keywords are present in the output of an experiment run.
- Parameters:
keywords (List[str]) – The keywords to search for in the output.
name (str, optional) – An optional name for the evaluator. Defaults to “ContainsAny(<keywords>)”.
Example
- class ContainsKeyword(*args, **kwargs)#
Bases:
CodeEvaluator
An evaluator that checks if a keyword is present in the output of an experiment run.
- Parameters:
keyword (str) – The keyword to search for in the output.
name (str, optional) – An optional name for the evaluator. Defaults to “Contains(<keyword>)”.
Example
- class JSONParsable(*args, **kwargs)#
Bases:
CodeEvaluator
An evaluator that checks if the output of an experiment run is a JSON-parsable string.
Example
- class MatchesRegex(*args, **kwargs)#
Bases:
CodeEvaluator
An experiment evaluator that checks if the output of an experiment run matches a regex pattern.
- Parameters:
pattern (Union[str, re.Pattern[str]]) – The regex pattern to match the output against.
name (str, optional) – An optional name for the evaluator. Defaults to “matches_({pattern})”.
Example
from phoenix.experiments import run_experiment from phoenix.experiments.evaluators import MatchesRegex phone_number_evaluator = MatchesRegex(r"\d{3}-\d{3}-\d{4}", name="valid-phone-number") run_experiment(dataset, task, evaluators=[phone_number_evaluator])
llm_evaluators#
- class CoherenceEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
An experiment evaluator that uses an LLM to evaluate whether the text is coherent.
- Parameters:
model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.
name (str, optional) – The name of the evaluator, defaults to “Coherence”.
- class ConcisenessEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
An experiment evaluator that uses an LLM to evaluate whether the text is concise.
- Parameters:
model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.
name (str, optional) – The name of the evaluator, defaults to “Conciseness”.
- class HelpfulnessEvaluator(*args, **kwargs)#
Bases:
LLMCriteriaEvaluator
An experiment evaluator that uses an LLM to evaluate whether the text is helpful.
- Parameters:
model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.
name (str, optional) – The name of the evaluator, defaults to “Helpfulness”.
- class LLMCriteriaEvaluator(*args, **kwargs)#
Bases:
LLMEvaluator
An experiment evaluator that uses an LLM to evaluate whether the text meets a custom criteria.
This evaluator uses the chain-of-thought technique to perform a binary evaluation of text based on a custom criteria and description. When used as an experiment evaluator, LLMCriteriaEvaluator will return a score of 1.0 if the text meets the criteria and a score of 0.0 if not. The explanation produced by the chain-of-thought technique will be included in the experiment evaluation as well.
- Example criteria and descriptions:
“thoughtfulness” - “shows careful consideration and fair judgement”
“clarity” - “is easy to understand and follow”
“professionalism” - “is respectful and appropriate for a formal setting”
- Parameters:
model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.
criteria – The criteria to evaluate the text against, the criteria should be able to be used as a noun in a sentence.
description (str) – A description of the criteria, used to clarify instructions to the LLM. The description should complete this sentence: “{criteria} means the text {description}”.
name (str) – The name of the evaluator
- class RelevanceEvaluator(*args, **kwargs)#
Bases:
LLMEvaluator
An experiment evaluator that uses an LLM to evaluate whether a response is relevant to a query.
This evaluator uses the chain-of-thought technique to perform a binary evaluation of whether the output “response” of an experiment is relevant to its input “query”. When used as an experiment evaluator, RelevanceEvaluator will return a score of 1.0 if the response is relevant to the query and a score of 0.0 if not. The explanation produced by the chain-of-thought technique will be included in the experiment evaluation as well.
Optionally, you can provide custom functions to extract the query and response from the input and output of the experiment task. By default, the evaluator will use the dataset example as the input and the output of the experiment task as the response.
- Parameters:
model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.
get_query (callable, optional) – A function that extracts the query from the input of the experiment task. The function should take the input and metadata of the dataset example and return a string. By default, the function will return the string representation of the input.
get_response (callable, optional) – A function that extracts the response from the output of the experiment task. The function should take the output and metadata of the experiment task and return a string. By default, the function will return the string representation of the output.
name (str, optional) – The name of the evaluator. Defaults to “Relevance”.
utils#
- create_evaluator(kind=AnnotatorKind.CODE, name=None, scorer=None)#
A decorator that configures a sync or async function to be used as an experiment evaluator.
If the evaluator is a function of one argument then that argument will be bound to the output of an experiment task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example output: The output of an experiment task expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example
- Parameters:
kind (str | AnnotatorKind) – Broadly indicates how the evaluator scores an experiment run. Valid kinds are: “CODE”, “LLM”. Defaults to “CODE”.
name (str, optional) – The name of the evaluator. If not provided, the name of the function will be used.
scorer (callable, optional) – An optional function that converts the output of the wrapped function into an EvaluationResult. This allows configuring the evaluation payload by setting a label, score and explanation. By default, numeric outputs will be recorded as scores, boolean outputs will be recorded as scores and labels, and string outputs will be recorded as labels. If the output is a 2-tuple, the first item will be recorded as the score and the second item will recorded as the explanation.
Examples
Configuring an evaluator that returns a boolean
Configuring an evaluator that returns a label
Configuring an evaluator that returns a score and explanation