Experiments#

class Experiments(client, *, _guard=None)#

Bases: object

Provides methods for running experiments and evaluations.

An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.

A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:

an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

Example

Basic usage:

from phoenix.client import Client
client = Client()
dataset = client.datasets.get_dataset(dataset="my-dataset")

def my_task(input):
    return f"Hello {input['name']}"

experiment = client.experiments.run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="greeting-experiment"
)

With evaluators:

def accuracy_evaluator(output, expected):
    return 1.0 if output == expected['text'] else 0.0

experiment = client.experiments.run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy_evaluator],
    experiment_name="evaluated-experiment"
)

Using dynamic binding for tasks:

def my_task(input, metadata, expected):
    # Task can access multiple fields from the dataset example
    context = metadata.get("context", "")
    return f"Context: {context}, Input: {input}, Expected: {expected}"

Using dynamic binding for evaluators:

def my_evaluator(output, input, expected, metadata):
    # Evaluator can access task output and example fields
    score = calculate_similarity(output, expected)
    return {"score": score, "label": "pass" if score > 0.8 else "fail"}

create(*, dataset_id, dataset_version_id=None, experiment_name=None, experiment_description=None, experiment_metadata=None, splits=None, repetitions=1, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Create a new experiment without running it.

This method creates an experiment record in the Phoenix database but does not execute any tasks. Use resume_experiment to run tasks on the created experiment.

Parameters:

dataset_id (str) – The ID of the dataset on which the experiment will be run.
dataset_version_id (Optional[str]) – The ID of the dataset version to use. If not provided, the latest version will be used. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
splits (Optional[Sequence[str]]) – List of dataset split identifiers (IDs or names) to filter by. Defaults to None.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
timeout (Optional[int]) – The timeout for the request in seconds. Defaults to 60.

Returns:

The newly created experiment.

Return type:

Experiment

Raises:

httpx.HTTPStatusError – If the API returns an error response.

Example:

from phoenix.client import Client
client = Client()

experiment = client.experiments.create(
    dataset_id="dataset_123",
    experiment_name="my-experiment",
    experiment_description="Testing my task",
    repetitions=3,
)
print(f"Created experiment with ID: {experiment['id']}")

client.experiments.resume_experiment(
    experiment_id=experiment["id"],
    task=my_task,
)

delete(*, experiment_id, delete_project=False)#

Delete an experiment by ID.

Parameters:

experiment_id (str) – The ID of the experiment to delete.
delete_project (bool) – If True, also delete the project associated with the experiment. Defaults to False.

Raises:

httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.

Example:

from phoenix.client import Client
client = Client()

client.experiments.delete(experiment_id="exp_123")

evaluate_experiment(*, experiment, evaluators, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#

Run evaluators on a completed experiment.

An evaluator is a synchronous function that returns an evaluation result object, which can take any of the following forms:

an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

Parameters:

experiment (RanExperiment) – The experiment to evaluate, returned from run_experiment.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
dry_run (bool) – Run the evaluation in dry-run mode. When set, evaluation results will not be recorded in Phoenix. Defaults to False.
print_summary (bool) – Whether to print a summary of the evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the evaluation execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

A dictionary containing the evaluation results with the same format: as run_experiment.

Return type:

RanExperiment

Raises:

ValueError – If no evaluators are provided or experiment has no runs.
httpx.HTTPStatusError – If the API returns an error response.

get(*, experiment_id)#

Get an experiment by ID.

Parameters:

experiment_id (str) – The ID of the experiment to retrieve.

Returns:

The experiment with the specified ID.

Return type:

Experiment

Raises:

httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.

Example:

from phoenix.client import Client
client = Client()

experiment = client.experiments.get(experiment_id="exp_123")
print(f"Example count: {experiment['example_count']}")
print(f"Successful runs: {experiment['successful_run_count']}")

get_dataset_experiments_url(dataset_id)#

get_experiment(*, experiment_id)#

Get a completed experiment by ID.

This method retrieves a completed experiment with all its task runs and evaluation runs, returning a RanExperiment object that can be used with evaluate_experiment to run additional evaluations.

Parameters:

experiment_id (str) – The ID of the experiment to retrieve.

Returns:

A RanExperiment object containing the experiment data, task runs,: and evaluation runs.

Return type:

RanExperiment

Raises:

ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.

Examples:

client = Client()
experiment = client.experiments.get_experiment(experiment_id="123")
client.experiments.evaluate_experiment(
    experiment=experiment,
    evaluators=[
        correctness,
    ],
    print_summary=True,
)

get_experiment_url(dataset_id, experiment_id)#

list(*, dataset_id, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

List all experiments for a dataset with automatic pagination handling.

This method automatically handles pagination behind the scenes and returns a simple list of experiments.

Parameters:

dataset_id (str) – The ID of the dataset to list experiments for.
timeout – Request timeout in seconds for each paginated request (default: 60).

Returns:

A list of all experiments for the dataset.

Return type:

list[Experiment]

Raises:

httpx.HTTPError – If the request fails.

Example:

from phoenix.client import Client
client = Client()

experiments = client.experiments.list(dataset_id="dataset_123")
for experiment in experiments:
    print(f"Experiment: {experiment['id']}, Runs: {experiment['successful_run_count']}")

log_evaluation(*, experiment_run_id, name, annotator_kind='CODE', start_time=None, end_time=None, score=None, label=None, explanation=None, error=None, metadata=None, trace_id=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Post (upsert) a single evaluation for an experiment run.

The server upserts on (experiment_run_id, name), so re-posting the same name replaces the prior annotation. At least one of score/label/explanation (i.e. a result) or error must be provided.

Parameters:

experiment_run_id (str) – The ID of the run being evaluated.
name (str) – The annotation name. One run carries many annotations keyed by name.
annotator_kind (str) – One of “CODE”, “LLM”, “HUMAN”. Defaults to “CODE”.
start_time (Optional[datetime]) – Evaluation start. Defaults to now.
end_time (Optional[datetime]) – Evaluation end. Defaults to start_time.
score (Optional[float]) – Numeric score.
label (Optional[str]) – Categorical label.
explanation (Optional[str]) – Free-text explanation.
error (Optional[str]) – Error repr if the evaluation failed.
metadata (Optional[Mapping[str, Any]]) – Extra metadata for the annotation.
trace_id (Optional[str]) – Trace ID correlating the evaluation to its spans.
timeout (Optional[int]) – Request timeout in seconds. Defaults to 60.

Returns:

The server response, carrying the: id of the upserted evaluation (the only field the server returns).

Return type:

UpsertExperimentEvaluationResponseBodyData

Raises:

ValueError – If neither a result (score/label/explanation) nor an error is provided.
httpx.HTTPStatusError – On API errors.

log_run(*, experiment_id, dataset_example_id, output, start_time, end_time, repetition_number=1, trace_id=None, error=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Post a single experiment run and return the server record.

This is the public, single-run counterpart to run_experiment’s batch loop. It is the primitive the pytest plugin and any incremental-posting consumer build on.

Parameters:

experiment_id (str) – The ID of the experiment the run belongs to.
dataset_example_id (str) – The node ID (GlobalID) of the dataset example the run executed on.
output – The JSON-serializable task output. Ignored by the server when error is set, but still recorded.
start_time (datetime) – When the run started.
end_time (datetime) – When the run finished.
repetition_number (int) – 1-based repetition index for this example. Defaults to 1.
trace_id (Optional[str]) – Trace ID correlating the run to its spans. Defaults to None.
error (Optional[str]) – Error repr if the run failed. A run stored with an error is re-runnable (the server upserts it); a successful run is immutable. Defaults to None.
timeout (Optional[int]) – Request timeout in seconds. Defaults to 60.

Returns:

The run record, with id set to the server-assigned ID on success.

Return type:

ExperimentRun

Raises:

httpx.HTTPStatusError – On API errors, including a 409 when a successful (immutable) run already exists for this (experiment, example, repetition). Callers that expect duplicates (e.g. concurrent runners) should catch the 409 and decide what to do with the already-recorded run; there is no nothing-to-resolve placeholder.

resume_evaluation(*, experiment_id, evaluators, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#

Resume incomplete evaluations for an experiment.

This method identifies which evaluations have not been completed (either missing or failed) and runs the evaluators only for those runs. This is useful for: - Recovering from transient evaluator failures - Adding new evaluators to completed experiments - Completing partially evaluated experiments

The method processes incomplete evaluations in batches using pagination to minimize memory usage.

Evaluation names are matched to evaluator dict keys. For example, if you pass {"accuracy": accuracy_fn}, it will check for and resume any runs missing the “accuracy” evaluation.

Note

Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.

Parameters:

experiment_id (str) – The ID of the experiment to resume evaluations for.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators to run. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
print_summary (bool) – Whether to print a summary of evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for evaluation execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Raises:

ValueError – If the experiment is not found or no evaluators are provided.
httpx.HTTPStatusError – If the API returns an error response.

Example:

from phoenix.client import Client
client = Client()

def accuracy(output, expected):
    return 1.0 if output == expected else 0.0

# Standard usage: evaluation name matches evaluator key
client.experiments.resume_evaluation(
    experiment_id="exp_123",
    evaluators={"accuracy": accuracy},
)

resume_experiment(*, experiment_id, task, evaluators=None, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, rate_limit_errors=None, retries=3)#

Resume an incomplete experiment by running only the missing or failed runs.

This method identifies which (example, repetition) pairs have not been completed (either missing or failed) and re-runs the task only for those pairs. Optionally, evaluators can be run on the completed runs after task execution.

The method processes incomplete runs in batches using pagination to minimize memory usage.

Note

Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.

Parameters:

experiment_id (str) – The ID of the experiment to resume.
task (ExperimentTask) – The task to run on incomplete examples.
evaluators (Optional[ExperimentEvaluators]) – Optional evaluators to run on completed task runs. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
print_summary (bool) – Whether to print a summary of the results. Defaults to True.
timeout (Optional[int]) – The timeout for task execution in seconds. Defaults to 60.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

None

Raises:

ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.

Example:

client = Client()

# Resume an interrupted experiment
client.experiments.resume_experiment(
    experiment_id="exp_123",
    task=my_task,
)

# Resume with evaluators
client.experiments.resume_experiment(
    experiment_id="exp_123",
    task=my_task,
    evaluators={"quality": my_evaluator},
)

run_experiment(*, dataset, task, evaluators=None, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, repetitions=1, retries=3)#

Runs an experiment using a given dataset of examples.

An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.

A task is a synchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

An evaluator is either a synchronous function that returns an evaluation result object, which can take any of the following forms:

an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

Parameters:

dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[ExperimentEvaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (Union[bool, int]) – Run the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the task execution in seconds. Use this to run longer tasks to avoid re-queuing the same task multiple times. Defaults to 60.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

A dictionary containing the experiment results.

Return type:

RanExperiment

Raises:

ValueError – If dataset format is invalid or has no examples.
httpx.HTTPStatusError – If the API returns an error response.

class AsyncExperiments(client, *, _guard=None)#

Bases: object

Provides async methods for running experiments and evaluations.

An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.

A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:

phoenix.experiments.types.EvaluationResult with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

Example

Basic usage:

from phoenix.client import AsyncClient
client = AsyncClient()
dataset = await client.datasets.get_dataset(dataset="my-dataset")

async def my_task(input):
    return f"Hello {input['name']}"

experiment = await client.experiments.run_experiment(
    dataset=dataset,
    task=my_task,
    experiment_name="greeting-experiment"
)

With evaluators:

async def accuracy_evaluator(output, expected):
    return 1.0 if output == expected['text'] else 0.0

experiment = await client.experiments.run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[accuracy_evaluator],
    experiment_name="evaluated-experiment"
)

Using dynamic binding for tasks:

async def my_task(input, metadata, expected):
    # Task can access multiple fields from the dataset example
    context = metadata.get("context", "")
    return f"Context: {context}, Input: {input}, Expected: {expected}"

Using dynamic binding for evaluators:

async def my_evaluator(output, input, expected, metadata):
    # Evaluator can access task output and example fields
    score = await calculate_similarity(output, expected)
    return {"score": score, "label": "pass" if score > 0.8 else "fail"}

async create(*, dataset_id, dataset_version_id=None, experiment_name=None, experiment_description=None, experiment_metadata=None, splits=None, repetitions=1, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Create a new experiment without running it (async version).

This method creates an experiment record in the Phoenix database but does not execute any tasks. Use resume_experiment to run tasks on the created experiment.

Parameters:

dataset_id (str) – The ID of the dataset on which the experiment will be run.
dataset_version_id (Optional[str]) – The ID of the dataset version to use. If not provided, the latest version will be used. Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
splits (Optional[Sequence[str]]) – List of dataset split identifiers (IDs or names) to filter by. Defaults to None.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
timeout (Optional[int]) – The timeout for the request in seconds. Defaults to 60.

Returns:

The newly created experiment.

Return type:

Experiment

Raises:

httpx.HTTPStatusError – If the API returns an error response.

Example:

from phoenix.client import AsyncClient
async_client = AsyncClient()

experiment = await async_client.experiments.create(
    dataset_id="dataset_123",
    experiment_name="my-experiment",
    experiment_description="Testing my task",
    repetitions=3,
)
print(f"Created experiment with ID: {experiment['id']}")

await async_client.experiments.resume_experiment(
    experiment_id=experiment["id"],
    task=my_task,
)

async delete(*, experiment_id, delete_project=False)#

Delete an experiment by ID.

Parameters:

experiment_id (str) – The ID of the experiment to delete.
delete_project (bool) – If True, also delete the project associated with the experiment. Defaults to False.

Raises:

httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.

Example:

from phoenix.client import AsyncClient
async_client = AsyncClient()

await async_client.experiments.delete(experiment_id="exp_123")

async evaluate_experiment(*, experiment, evaluators, dry_run=False, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#

Run evaluators on a completed experiment.

An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:

an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

Parameters:

experiment (RanExperiment) – The experiment to evaluate, returned from run_experiment.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
dry_run (bool) – Run the evaluation in dry-run mode. When set, evaluation results will not be recorded in Phoenix. Defaults to False.
print_summary (bool) – Whether to print a summary of the evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for the evaluation execution in seconds. Defaults to 60.
concurrency (int) – Specifies the concurrency for evaluation execution. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

A dictionary containing the evaluation results with the same format: as run_experiment.

Return type:

RanExperiment

Raises:

ValueError – If no evaluators are provided or experiment has no runs.
httpx.HTTPStatusError – If the API returns an error response.

async get(*, experiment_id)#

Get an experiment by ID.

Parameters:

experiment_id (str) – The ID of the experiment to retrieve.

Returns:

The experiment with the specified ID.

Return type:

Experiment

Raises:

httpx.HTTPError – If the request fails.
ValueError – If the experiment is not found.

Example:

from phoenix.client import AsyncClient
async_client = AsyncClient()

experiment = await async_client.experiments.get(experiment_id="exp_123")
print(f"Example count: {experiment['example_count']}")
print(f"Successful runs: {experiment['successful_run_count']}")

get_dataset_experiments_url(dataset_id)#

async get_experiment(*, experiment_id)#

Get a completed experiment by ID (async version).

This method retrieves a completed experiment with all its task runs and evaluation runs, returning a RanExperiment object that can be used with evaluate_experiment to run additional evaluations.

Parameters:

experiment_id (str) – The ID of the experiment to retrieve.

Returns:

A RanExperiment object containing the experiment data, task runs,: and evaluation runs.

Return type:

RanExperiment

Raises:

ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.

Examples:

client = AsyncClient()
experiment = await client.experiments.get_experiment(experiment_id="123")
await client.experiments.evaluate_experiment(
    experiment=experiment,
    evaluators=[
        correctness,
    ],
    print_summary=True,
)

get_experiment_url(dataset_id, experiment_id)#

async list(*, dataset_id, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

List all experiments for a dataset with automatic pagination handling.

This method automatically handles pagination behind the scenes and returns a simple list of experiments.

Parameters:

dataset_id (str) – The ID of the dataset to list experiments for.
timeout – Request timeout in seconds for each paginated request (default: 60).

Returns:

A list of all experiments for the dataset.

Return type:

list[Experiment]

Raises:

httpx.HTTPError – If the request fails.

Example:

from phoenix.client import AsyncClient
async_client = AsyncClient()

experiments = await async_client.experiments.list(dataset_id="dataset_123")
for experiment in experiments:
    print(f"Experiment: {experiment['id']}, Runs: {experiment['successful_run_count']}")

async log_evaluation(*, experiment_run_id, name, annotator_kind='CODE', start_time=None, end_time=None, score=None, label=None, explanation=None, error=None, metadata=None, trace_id=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Post (upsert) a single evaluation for an experiment run.

Async counterpart to Experiments.log_evaluation; see that method for full semantics.

async log_run(*, experiment_id, dataset_example_id, output, start_time, end_time, repetition_number=1, trace_id=None, error=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#

Post a single experiment run and return the server record.

Async counterpart to Experiments.log_run; see that method for full semantics.

async resume_evaluation(*, experiment_id, evaluators, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#

Resume incomplete evaluations for an experiment (async version).

This method identifies which evaluations have not been completed (either missing or failed) and runs the evaluators only for those runs. This is useful for: - Recovering from transient evaluator failures - Adding new evaluators to completed experiments - Completing partially evaluated experiments

The method processes incomplete evaluations in batches using pagination to minimize memory usage.

Evaluation names are matched to evaluator dict keys. For example, if you pass {"accuracy": accuracy_fn}, it will check for and resume any runs missing the “accuracy” evaluation.

Note

Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.

Parameters:

experiment_id (str) – The ID of the experiment to resume evaluations for.
evaluators (ExperimentEvaluators) – A single evaluator or sequence of evaluators to run. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated).
print_summary (bool) – Whether to print a summary of evaluation results. Defaults to True.
timeout (Optional[int]) – The timeout for evaluation execution in seconds. Defaults to 60.
concurrency (int) – The number of concurrent evaluations to run. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Raises:

ValueError – If the experiment is not found or no evaluators are provided.
httpx.HTTPStatusError – If the API returns an error response.

Example:

from phoenix.client import AsyncClient
client = AsyncClient()

async def accuracy(output, expected):
    return 1.0 if output == expected else 0.0

# Standard usage: evaluation name matches evaluator key
await client.experiments.resume_evaluation(
    experiment_id="exp_123",
    evaluators={"accuracy": accuracy},
)

async resume_experiment(*, experiment_id, task, evaluators=None, print_summary=True, timeout=DEFAULT_TIMEOUT_IN_SECONDS, concurrency=3, rate_limit_errors=None, retries=3)#

Resume an incomplete experiment by running only the missing or failed runs.

This method identifies which (example, repetition) pairs have not been completed (either missing or failed) and re-runs the task only for those pairs. Optionally, evaluators can be run on the completed runs after task execution.

The method processes incomplete runs in batches using pagination to minimize memory usage.

Note

Multi-output evaluators (evaluators that return a list/sequence of results) are not supported for resume operations. Each evaluator should produce a single evaluation result with a name matching the evaluator’s key in the dictionary.

Parameters:

experiment_id (str) – The ID of the experiment to resume.
task (ExperimentTask) – The task to run on incomplete examples.
evaluators (Optional[ExperimentEvaluators]) – Optional evaluators to run on completed task runs. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
print_summary (bool) – Whether to print a summary of the results. Defaults to True.
timeout (Optional[int]) – The timeout for task execution in seconds. Defaults to 60.
concurrency (int) – The number of concurrent tasks to run. Defaults to 3.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

None

Raises:

ValueError – If the experiment is not found.
httpx.HTTPStatusError – If the API returns an error response.

Example:

client = AsyncClient()

# Resume an interrupted experiment
await client.experiments.resume_experiment(
    experiment_id="exp_123",
    task=my_async_task,
)

# Resume with evaluators
await client.experiments.resume_experiment(
    experiment_id="exp_123",
    task=my_async_task,
    evaluators={"quality": my_evaluator},
)

async run_experiment(*, dataset, task, evaluators=None, experiment_name=None, experiment_description=None, experiment_metadata=None, rate_limit_errors=None, dry_run=False, print_summary=True, concurrency=3, timeout=DEFAULT_TIMEOUT_IN_SECONDS, repetitions=1, retries=3)#

Runs an experiment using a given dataset of examples (async version).

An experiment is a user-defined task that runs on each example in a dataset. The results from each experiment can be evaluated using any number of evaluators to measure the behavior of the task. The experiment and evaluation results are stored in the Phoenix database for comparison and analysis.

A task is either a synchronous or asynchronous function that returns a JSON serializable output. If the task is a function of one argument then that argument will be bound to the input field of the dataset example. Alternatively, the task can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

An evaluator is either a synchronous or asynchronous function that returns an evaluation result object, which can take any of the following forms:

an EvaluationResult dict with optional fields for score, label, explanation and metadata
a bool, which will be interpreted as a score of 0 or 1 plus a label of “True” or “False”
a float, which will be interpreted as a score
a str, which will be interpreted as a label
a 2-tuple of (float, str), which will be interpreted as (score, explanation)

If the evaluator is a function of one argument then that argument will be bound to the output of the task. Alternatively, the evaluator can be a function of any combination of specific argument names that will be bound to special values:

input: The input field of the dataset example
output: The output of the task
expected: The expected or reference output of the dataset example
reference: An alias for expected
metadata: Metadata associated with the dataset example
example: The dataset Example object with all associated fields

Parameters:

dataset (Dataset) – The dataset on which to run the experiment.
task (ExperimentTask) – The task to run on each example in the dataset.
evaluators (Optional[ExperimentEvaluators]) – A single evaluator or sequence of evaluators used to evaluate the results of the experiment. Evaluators can be provided as a dict mapping names to functions, or as a list of functions (names will be auto-generated). Defaults to None.
experiment_name (Optional[str]) – The name of the experiment. Defaults to None.
experiment_description (Optional[str]) – A description of the experiment. Defaults to None.
experiment_metadata (Optional[Mapping[str, Any]]) – Metadata to associate with the experiment. Defaults to None.
rate_limit_errors (Optional[RateLimitErrors]) – An exception or sequence of exceptions to adaptively throttle on. Defaults to None.
dry_run (Union[bool, int]) – Run the experiment in dry-run mode. When set, experiment results will not be recorded in Phoenix. If True, the experiment will run on a random dataset example. If an integer, the experiment will run on a random sample of the dataset examples of the given size. Defaults to False.
print_summary (bool) – Whether to print a summary of the experiment and evaluation results. Defaults to True.
concurrency (int) – Specifies the concurrency for task execution. Defaults to 3.
timeout (Optional[int]) – The timeout for the task execution in seconds. Use this to run longer tasks to avoid re-queuing the same task multiple times. Defaults to 60.
repetitions (int) – The number of times the task will be run on each example. Defaults to 1.
retries (int) – The number of times to retry a task if it fails. Defaults to 3.

Returns:

A dictionary containing the experiment results.

Return type:

RanExperiment

Raises:

ValueError – If dataset format is invalid or has no examples.
httpx.HTTPStatusError – If the API returns an error response.