Experiments#
- class Experiments(client, *, _guard=None)#
Bases:
objectProvides methods for running experiments and evaluations.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:
an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
Example
Basic usage:
from phoenix.client import Client client = Client() dataset = client.datasets.get_dataset(dataset="my-dataset") def my_task(input): return f"Hello {input['name']}" experiment = client.experiments.run_experiment( dataset=dataset, task=my_task, experiment_name="greeting-experiment" )
With evaluators:
def accuracy_evaluator(output, expected): return 1.0 if output == expected['text'] else 0.0 experiment = client.experiments.run_experiment( dataset=dataset, task=my_task, evaluators=[accuracy_evaluator], experiment_name="evaluated-experiment" )
Using dynamic binding for tasks:
def my_task(input, metadata, expected): # Task can access multiple fields from the dataset example context = metadata.get("context", "") return f"Context: {context}, Input: {input}, Expected: {expected}"
Using dynamic binding for evaluators:
def my_evaluator(output, input, expected, metadata): # Evaluator can access task output and example fields score = calculate_similarity(output, expected) return {"score": score, "label": "pass" if score > 0.8 else "fail"}
- create(*, dataset_id, dataset_version_id=None, experiment_name=None, experiment_description=None, experiment_metadata=None, splits=None, repetitions=1, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Create a new experiment without running it.
This method creates an experiment record in the Phoenix database but does not execute any tasks. Use resume_experiment to run tasks on the created experiment.
- Parameters:
dataset_id (str) – The ID of the dataset on which the experiment will be run.
dataset_version_id (Optional[str]) – The ID of the dataset version to use. If not provided, the latest version will be used. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
splits (Optional[Sequence[str]]) – List of dataset split identifiers (IDs or names) to filter by. Defaults to None.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
timeout (Optional[int]) – The timeout for the request in seconds. Defaults to 60.
- Returns:
The newly created experiment.
- Return type:
Experiment
- Raises:
httpx.HTTPStatusError – If the API returns an error response.
Example:
from phoenix.client import Client client = Client() experiment = client.experiments.create( dataset_id="dataset_123", experiment_name="my-experiment", experiment_description="Testing my task", repetitions=3, ) print(f"Created experiment with ID: {experiment['id']}") # Later, run the experiment client.experiments.resume_experiment( experiment_id=experiment["id"], task=my_task, )
- delete(*, experiment_id, delete_project=False)#
Delete an experiment by ID.
- Parameters:
experiment_id (str) – The ID of the experiment to delete.
delete_project (bool) – If True, also delete the project associated with the experiment. Defaults to False.
- Raises:
httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.
Example:
from phoenix.client import Client client = Client() client.experiments.delete(experiment_id="exp_123")
- evaluate_experiment(*, experiment, evaluators, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#
Run evaluators on a completed experiment.
An evaluator is a synchronous function that returns an evaluation result object, which can take any of the following forms:
an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
- Parameters:
experiment (RanExperiment) – The experiment to evaluate, returned from run_experiment.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
dry_run (bool) – Run the evaluation in dry-run mode. When set, evaluation results will not be recorded in Phoenix. Defaults to False.
print_summary (bool) – Whether to print a summary of the evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the evaluation execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
- A dictionary containing the evaluation results with the same format
as run_experiment.
- Return type:
RanExperiment
- Raises:
ValueError – If no evaluators are provided or experiment has no runs.
httpx.HTTPStatusError – If the API returns an error response.
- get(*, experiment_id)#
Get an experiment by ID.
- Parameters:
experiment_id (str) – The ID of the experiment to retrieve.
- Returns:
The experiment with the specified ID.
- Return type:
Experiment
- Raises:
httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.
Example:
from phoenix.client import Client client = Client() experiment = client.experiments.get(experiment_id="exp_123") print(f"Example count: {experiment['example_count']}") print(f"Successful runs: {experiment['successful_run_count']}")
- get_dataset_experiments_url(dataset_id)#
- get_experiment(*, experiment_id)#
Get a completed experiment by ID.
This method retrieves a completed experiment with all its task runs and evaluation runs, returning a RanExperiment object that can be used with evaluate_experiment to run additional evaluations.
- Parameters:
experiment_id (str) – The ID of the experiment to retrieve.
- Returns:
- A RanExperiment object containing the experiment data, task runs,
and evaluation runs.
- Return type:
RanExperiment
- Raises:
ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.
Examples:
client = Client() experiment = client.experiments.get_experiment(experiment_id="123") client.experiments.evaluate_experiment( experiment=experiment, evaluators=[ correctness, ], print_summary=True, )
- get_experiment_url(dataset_id, experiment_id)#
- list(*, dataset_id, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
List all experiments for a dataset with automatic pagination handling.
This method automatically handles pagination behind the scenes and returns a simple list of experiments.
- Parameters:
dataset_id (str) – The ID of the dataset to list experiments for.
timeout – Request timeout in seconds for each paginated request (default: 60).
- Returns:
A list of all experiments for the dataset.
- Return type:
list[Experiment]
- Raises:
httpx.HTTPError – If the request fails.
Example:
from phoenix.client import Client client = Client() experiments = client.experiments.list(dataset_id="dataset_123") for experiment in experiments: print(f"Experiment: {experiment['id']}, Runs: {experiment['successful_run_count']}")
- resume_evaluation(*, experiment_id, evaluators, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#
Resume incomplete evaluations for an experiment.
This method identifies which evaluations have not been completed (either missing or failed) and runs the evaluators only for those runs. This is useful for: - Recovering from transient evaluator failures - Adding new evaluators to completed experiments - Completing partially evaluated experiments
The method processes incomplete evaluations in batches using pagination to minimize memory usage.
Evaluation names are matched to evaluator dict keys. For example, if you pass
{"accuracy": accuracy_fn}, it will check for and resume any runs missing the “accuracy” evaluation.Note
Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.
- Parameters:
experiment_id (str) – The ID of the experiment to resume evaluations for.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators to run. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
print_summary (bool) – Whether to print a summary of evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for evaluation execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Raises:
ValueError – If the experiment is not found or no evaluators are provided.
httpx.HTTPStatusError – If the API returns an error response.
Example:
from phoenix.client import Client client = Client() def accuracy(output, expected): return 1.0 if output == expected else 0.0 # Standard usage: evaluation name matches evaluator key client.experiments.resume_evaluation( experiment_id="exp_123", evaluators={"accuracy": accuracy}, )
- resume_experiment(*, experiment_id, task, evaluators=None, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#
Resume an incomplete experiment by running only the missing or failed runs.
This method identifies which (example, repetition) pairs have not been completed (either missing or failed) and re-runs the task only for those pairs. Optionally, evaluators can be run on the completed runs after task execution.
The method processes incomplete runs in batches using pagination to minimize memory usage.
Note
Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.
- Parameters:
experiment_id (str) – The ID of the experiment to resume.
task (ExperimentTask) – The task to run on incomplete examples.
evaluators (Optional[ExperimentEvaluators]) – Optional evaluators to run on completed task runs. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
print_summary (bool) – Whether to print a summary of the results. Defaults to True.
timeout (Optional[int]) – The timeout for task execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
None
- Raises:
ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.
Example:
client = Client() # Resume an interrupted experiment client.experiments.resume_experiment( experiment_id="exp_123", task=my_task, ) # Resume with evaluators client.experiments.resume_experiment( experiment_id="exp_123", task=my_task, evaluators={"quality": my_evaluator}, )
- run_experiment(*, dataset, task, evaluators=None, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, repetitions=1, retries=3)#
Runs an experiment using a given dataset of examples.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is a synchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
An evaluator is either a synchronous function that returns an evaluation result object, which can take any of the following forms:
an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
- Parameters:
dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[ExperimentEvaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (Union[bool, int]) – Run the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the task execution in seconds. Use this to run longer tasks to avoid re-queuing the same task multiple times. Defaults to 60.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
A dictionary containing the experiment results.
- Return type:
RanExperiment
- Raises:
ValueError – If dataset format is invalid or has no examples.
httpx.HTTPStatusError – If the API returns an error response.
- class AsyncExperiments(client, *, _guard=None)#
Bases:
objectProvides async methods for running experiments and evaluations.
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:
phoenix.experiments.types.EvaluationResult with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
Example
Basic usage:
from phoenix.client import AsyncClient client = AsyncClient() dataset = await client.datasets.get_dataset(dataset="my-dataset") async def my_task(input): return f"Hello {input['name']}" experiment = await client.experiments.run_experiment( dataset=dataset, task=my_task, experiment_name="greeting-experiment" )
With evaluators:
async def accuracy_evaluator(output, expected): return 1.0 if output == expected['text'] else 0.0 experiment = await client.experiments.run_experiment( dataset=dataset, task=my_task, evaluators=[accuracy_evaluator], experiment_name="evaluated-experiment" )
Using dynamic binding for tasks:
async def my_task(input, metadata, expected): # Task can access multiple fields from the dataset example context = metadata.get("context", "") return f"Context: {context}, Input: {input}, Expected: {expected}"
Using dynamic binding for evaluators:
async def my_evaluator(output, input, expected, metadata): # Evaluator can access task output and example fields score = await calculate_similarity(output, expected) return {"score": score, "label": "pass" if score > 0.8 else "fail"}
- async create(*, dataset_id, dataset_version_id=None, experiment_name=None, experiment_description=None, experiment_metadata=None, splits=None, repetitions=1, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Create a new experiment without running it (async version).
This method creates an experiment record in the Phoenix database but does not execute any tasks. Use resume_experiment to run tasks on the created experiment.
- Parameters:
dataset_id (str) – The ID of the dataset on which the experiment will be run.
dataset_version_id (Optional[str]) – The ID of the dataset version to use. If not provided, the latest version will be used. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
splits (Optional[Sequence[str]]) – List of dataset split identifiers (IDs or names) to filter by. Defaults to None.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
timeout (Optional[int]) – The timeout for the request in seconds. Defaults to 60.
- Returns:
The newly created experiment.
- Return type:
Experiment
- Raises:
httpx.HTTPStatusError – If the API returns an error response.
Example:
from phoenix.client import AsyncClient async_client = AsyncClient() experiment = await async_client.experiments.create( dataset_id="dataset_123", experiment_name="my-experiment", experiment_description="Testing my task", repetitions=3, ) print(f"Created experiment with ID: {experiment['id']}") # Later, run the experiment await async_client.experiments.resume_experiment( experiment_id=experiment["id"], task=my_task, )
- async delete(*, experiment_id, delete_project=False)#
Delete an experiment by ID.
- Parameters:
experiment_id (str) – The ID of the experiment to delete.
delete_project (bool) – If True, also delete the project associated with the experiment. Defaults to False.
- Raises:
httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.
Example:
from phoenix.client import AsyncClient async_client = AsyncClient() await async_client.experiments.delete(experiment_id="exp_123")
- async evaluate_experiment(*, experiment, evaluators, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#
Run evaluators on a completed experiment.
An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:
an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
- Parameters:
experiment (RanExperiment) – The experiment to evaluate, returned from run_experiment.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
dry_run (bool) – Run the evaluation in dry-run mode. When set, evaluation results will not be recorded in Phoenix. Defaults to False.
print_summary (bool) – Whether to print a summary of the evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the evaluation execution in seconds. Defaults to 60.
concurrency (int) – Specifies the concurrency for evaluation execution. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
- A dictionary containing the evaluation results with the same format
as run_experiment.
- Return type:
RanExperiment
- Raises:
ValueError – If no evaluators are provided or experiment has no runs.
httpx.HTTPStatusError – If the API returns an error response.
- async get(*, experiment_id)#
Get an experiment by ID.
- Parameters:
experiment_id (str) – The ID of the experiment to retrieve.
- Returns:
The experiment with the specified ID.
- Return type:
Experiment
- Raises:
httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.
Example:
from phoenix.client import AsyncClient async_client = AsyncClient() experiment = await async_client.experiments.get(experiment_id="exp_123") print(f"Example count: {experiment['example_count']}") print(f"Successful runs: {experiment['successful_run_count']}")
- get_dataset_experiments_url(dataset_id)#
- async get_experiment(*, experiment_id)#
Get a completed experiment by ID (async version).
This method retrieves a completed experiment with all its task runs and evaluation runs, returning a RanExperiment object that can be used with evaluate_experiment to run additional evaluations.
- Parameters:
experiment_id (str) – The ID of the experiment to retrieve.
- Returns:
- A RanExperiment object containing the experiment data, task runs,
and evaluation runs.
- Return type:
RanExperiment
- Raises:
ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.
Examples:
client = AsyncClient() experiment = await client.experiments.get_experiment(experiment_id="123") await client.experiments.evaluate_experiment( experiment=experiment, evaluators=[ correctness, ], print_summary=True, )
- get_experiment_url(dataset_id, experiment_id)#
- async list(*, dataset_id, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
List all experiments for a dataset with automatic pagination handling.
This method automatically handles pagination behind the scenes and returns a simple list of experiments.
- Parameters:
dataset_id (str) – The ID of the dataset to list experiments for.
timeout – Request timeout in seconds for each paginated request (default: 60).
- Returns:
A list of all experiments for the dataset.
- Return type:
list[Experiment]
- Raises:
httpx.HTTPError – If the request fails.
Example:
from phoenix.client import AsyncClient async_client = AsyncClient() experiments = await async_client.experiments.list(dataset_id="dataset_123") for experiment in experiments: print(f"Experiment: {experiment['id']}, Runs: {experiment['successful_run_count']}")
- async resume_evaluation(*, experiment_id, evaluators, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#
Resume incomplete evaluations for an experiment (async version).
This method identifies which evaluations have not been completed (either missing or failed) and runs the evaluators only for those runs. This is useful for: - Recovering from transient evaluator failures - Adding new evaluators to completed experiments - Completing partially evaluated experiments
The method processes incomplete evaluations in batches using pagination to minimize memory usage.
Evaluation names are matched to evaluator dict keys. For example, if you pass
{"accuracy": accuracy_fn}, it will check for and resume any runs missing the “accuracy” evaluation.Note
Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.
- Parameters:
experiment_id (str) – The ID of the experiment to resume evaluations for.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators to run. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
print_summary (bool) – Whether to print a summary of evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for evaluation execution in seconds. Defaults to 60.
concurrency (int) – The number of concurrent evaluations to run. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Raises:
ValueError – If the experiment is not found or no evaluators are provided.
httpx.HTTPStatusError – If the API returns an error response.
Example:
from phoenix.client import AsyncClient client = AsyncClient() async def accuracy(output, expected): return 1.0 if output == expected else 0.0 # Standard usage: evaluation name matches evaluator key await client.experiments.resume_evaluation( experiment_id="exp_123", evaluators={"accuracy": accuracy}, )
- async resume_experiment(*, experiment_id, task, evaluators=None, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#
Resume an incomplete experiment by running only the missing or failed runs.
This method identifies which (example, repetition) pairs have not been completed (either missing or failed) and re-runs the task only for those pairs. Optionally, evaluators can be run on the completed runs after task execution.
The method processes incomplete runs in batches using pagination to minimize memory usage.
Note
Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.
- Parameters:
experiment_id (str) – The ID of the experiment to resume.
task (ExperimentTask) – The task to run on incomplete examples.
evaluators (Optional[ExperimentEvaluators]) – Optional evaluators to run on completed task runs. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
print_summary (bool) – Whether to print a summary of the results. Defaults to True.
timeout (Optional[int]) – The timeout for task execution in seconds. Defaults to 60.
concurrency (int) – The number of concurrent tasks to run. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
None
- Raises:
ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.
Example:
client = AsyncClient() # Resume an interrupted experiment await client.experiments.resume_experiment( experiment_id="exp_123", task=my_async_task, ) # Resume with evaluators await client.experiments.resume_experiment( experiment_id="exp_123", task=my_async_task, evaluators={"quality": my_evaluator}, )
- async run_experiment(*, dataset, task, evaluators=None, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True, concurrency=3, timeout=DEFAULT_TIMEOUT_IN_SECONDS, repetitions=1, retries=3)#
Runs an experiment using a given dataset of examples (async version).
An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.
A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:
an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)
If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:
input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields
- Parameters:
dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[ExperimentEvaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (Union[bool, int]) – Run the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
concurrency (int) – Specifies the concurrency for task execution. Defaults to 3.
timeout (Optional[int]) – The timeout for the task execution in seconds. Use this to run longer tasks to avoid re-queuing the same task multiple times. Defaults to 60.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.
- Returns:
A dictionary containing the experiment results.
- Return type:
RanExperiment
- Raises:
ValueError – If dataset format is invalid or has no examples.
httpx.HTTPStatusError – If the API returns an error response.