experiments#

phoenix.experiments#

evaluate_experiment(experiment, evaluators, *, dry_run=False, print_summary=True, rate_limit_errors=None)#
run_experiment(dataset, task, evaluators=None, *, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True)#

Runs an experiment using a given set of dataset of examples.

An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.

A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:

  • input: The input field of the dataset example

  • expected: The expected or reference output of the dataset example

  • reference: An alias for expected

  • metadata: Metadata associated with the dataset example

  • example: The dataset Example object with all associated fields

An evaluator is either a synchronous or asynchronous function that returns either a boolean or numeric “score”. If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

  • input: The input field of the dataset example

  • output: The output of the task

  • expected: The expected or reference output of the dataset example

  • reference: An alias for expected

  • metadata: Metadata associated with the dataset example

Phoenix also provides pre-built evaluators in the phoenix.experiments.evaluators module.

Parameters:
  • dataset (Dataset) – The dataset on which to run the experiment.

  • task (ExperimentTask) – The task to run on each example in the dataset.

  • evaluators (Optional[Evaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Defaults to None.

  • experiment_name (Optional[str]) – The name of the experiment. Defaults to None.

  • experiment_description (Optional[str]) – A description of the experiment. Defaults to None.

  • experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.

  • rate_limit_errors (Optional[BaseException | Sequence[BaseException]]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.

  • dry_run (bool | int) – R the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.

  • print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.

Returns:

The results of the experiment and evaluation. Additional evaluations can be

added to the experiment using the evaluate_experiment function.

Return type:

RanExperiment

phoenix.experiments.evaluators#

code_evaluators#

class ContainsAllKeywords(*args, **kwargs)#

Bases: CodeEvaluator

An evaluator that checks if all of the keywords are present in the output of an experiment run.

Parameters:
  • keywords (List[str]) – The keywords to search for in the output.

  • name (str, optional) – An optional name for the evaluator. Defaults to “ContainsAll(<keywords>)”.

Example

from phoenix.experiments import run_experiment
from phoenix.experiments.evaluators import ContainsAllKeywords

run_experiment(dataset, task, evaluators=[ContainsAllKeywords(["foo", "bar"])])
class ContainsAnyKeyword(*args, **kwargs)#

Bases: CodeEvaluator

An evaluator that checks if any of the keywords are present in the output of an experiment run.

Parameters:
  • keywords (List[str]) – The keywords to search for in the output.

  • name (str, optional) – An optional name for the evaluator. Defaults to “ContainsAny(<keywords>)”.

Example

class ContainsKeyword(*args, **kwargs)#

Bases: CodeEvaluator

An evaluator that checks if a keyword is present in the output of an experiment run.

Parameters:
  • keyword (str) – The keyword to search for in the output.

  • name (str, optional) – An optional name for the evaluator. Defaults to “Contains(<keyword>)”.

Example

class JSONParsable(*args, **kwargs)#

Bases: CodeEvaluator

An evaluator that checks if the output of an experiment run is a JSON-parsable string.

Example

class MatchesRegex(*args, **kwargs)#

Bases: CodeEvaluator

An experiment evaluator that checks if the output of an experiment run matches a regex pattern.

Parameters:
  • pattern (Union[str, re.Pattern[str]]) – The regex pattern to match the output against.

  • name (str, optional) – An optional name for the evaluator. Defaults to “matches_({pattern})”.

Example

from phoenix.experiments import run_experiment
from phoenix.experiments.evaluators import MatchesRegex

phone_number_evaluator = MatchesRegex(r"\d{3}-\d{3}-\d{4}", name="valid-phone-number")
run_experiment(dataset, task, evaluators=[phone_number_evaluator])

llm_evaluators#

class CoherenceEvaluator(*args, **kwargs)#

Bases: LLMCriteriaEvaluator

An experiment evaluator that uses an LLM to evaluate whether the text is coherent.

Parameters:
  • model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.

  • name (str, optional) – The name of the evaluator, defaults to “Coherence”.

class ConcisenessEvaluator(*args, **kwargs)#

Bases: LLMCriteriaEvaluator

An experiment evaluator that uses an LLM to evaluate whether the text is concise.

Parameters:
  • model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.

  • name (str, optional) – The name of the evaluator, defaults to “Conciseness”.

class HelpfulnessEvaluator(*args, **kwargs)#

Bases: LLMCriteriaEvaluator

An experiment evaluator that uses an LLM to evaluate whether the text is helpful.

Parameters:
  • model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.

  • name (str, optional) – The name of the evaluator, defaults to “Helpfulness”.

class LLMCriteriaEvaluator(*args, **kwargs)#

Bases: LLMEvaluator

An experiment evaluator that uses an LLM to evaluate whether the text meets a custom criteria.

This evaluator uses the chain-of-thought technique to perform a binary evaluation of text based on a custom criteria and description. When used as an experiment evaluator, LLMCriteriaEvaluator will return a score of 1.0 if the text meets the criteria and a score of 0.0 if not. The explanation produced by the chain-of-thought technique will be included in the experiment evaluation as well.

Example criteria and descriptions:
  • “thoughtfulness” - “shows careful consideration and fair judgement”

  • “clarity” - “is easy to understand and follow”

  • “professionalism” - “is respectful and appropriate for a formal setting”

Parameters:
  • model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.

  • criteria – The criteria to evaluate the text against, the criteria should be able to be used as a noun in a sentence.

  • description (str) – A description of the criteria, used to clarify instructions to the LLM. The description should complete this sentence: “{criteria} means the text {description}”.

  • name (str) – The name of the evaluator

class RelevanceEvaluator(*args, **kwargs)#

Bases: LLMEvaluator

An experiment evaluator that uses an LLM to evaluate whether a response is relevant to a query.

This evaluator uses the chain-of-thought technique to perform a binary evaluation of whether the output “response” of an experiment is relevant to its input “query”. When used as an experiment evaluator, RelevanceEvaluator will return a score of 1.0 if the response is relevant to the query and a score of 0.0 if not. The explanation produced by the chain-of-thought technique will be included in the experiment evaluation as well.

Optionally, you can provide custom functions to extract the query and response from the input and output of the experiment task. By default, the evaluator will use the dataset example as the input and the output of the experiment task as the response.

Parameters:
  • model – The LLM model wrapper to use for evaluation. Compatible models can be imported from the phoenix.evals module.

  • get_query (callable, optional) – A function that extracts the query from the input of the experiment task. The function should take the input and metadata of the dataset example and return a string. By default, the function will return the string representation of the input.

  • get_response (callable, optional) – A function that extracts the response from the output of the experiment task. The function should take the output and metadata of the experiment task and return a string. By default, the function will return the string representation of the output.

  • name (str, optional) – The name of the evaluator. Defaults to “Relevance”.

utils#

create_evaluator(kind=AnnotatorKind.CODE, name=None, scorer=None)#

A decorator that configures a sync or async function to be used as an experiment evaluator.

If the evaluator is a function of one argument then that argument will be bound to the output of an experiment task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example output: The output of an experiment task expected: The expected or reference output of the dataset example reference: An alias for expected metadata: Metadata associated with the dataset example

Parameters:
  • kind (str | AnnotatorKind) – Broadly indicates how the evaluator scores an experiment run. Valid kinds are: “CODE”, “LLM”. Defaults to “CODE”.

  • name (str, optional) – The name of the evaluator. If not provided, the name of the function will be used.

  • scorer (callable, optional) – An optional function that converts the output of the wrapped function into an EvaluationResult. This allows configuring the evaluation payload by setting a label, score and explanation. By default, numeric outputs will be recorded as scores, boolean outputs will be recorded as scores and labels, and string outputs will be recorded as labels. If the output is a 2-tuple, the first item will be recorded as the score and the second item will recorded as the explanation.

Examples

Configuring an evaluator that returns a boolean

Configuring an evaluator that returns a label

Configuring an evaluator that returns a score and explanation