Datasets#
- class Datasets(client, *, _guard=None)#
Bases:
objectClient for managing dataset resources in Phoenix.
This class provides methods for listing, retrieving, creating, and updating datasets. Datasets are collections of input/output examples used for training, evaluation, and experimentation.
- Key Methods:
list(): Get all datasets with automatic pagination
get_dataset(): Retrieve a specific dataset with examples
create_dataset(): Create new datasets from various sources
add_examples_to_dataset(): Add examples to existing datasets
Examples
Basic usage:
from phoenix.client import Client client = Client() # Get a dataset dataset = client.datasets.get_dataset(dataset="my-dataset") print(f"Dataset {dataset.name} has {len(dataset)} examples")
Listing datasets:
# Get all datasets (automatically handles pagination) all_datasets = client.datasets.list() print(f"Found {len(all_datasets)} total datasets") # Get limited number of datasets limited_datasets = client.datasets.list(limit=10) print(f"Found {len(limited_datasets)} datasets (limited to 10)")
Creating and updating datasets:
# Create a new dataset dataset = client.datasets.create_dataset( name="qa-dataset", inputs=[ {"question": "What is 2+2?"}, {"question": "What's the capital of France?"}, ], outputs=[{"answer": "4"}, {"answer": "Paris"}] ) # Add more examples later updated = client.datasets.add_examples_to_dataset( dataset="qa-dataset", inputs=[{"question": "Who wrote Hamlet?"}], outputs=[{"answer": "Shakespeare"}] )
Working with DataFrames:
import pandas as pd # Convert dataset to DataFrame df = dataset.to_dataframe() print(df.columns) # Index(['input', 'output', 'metadata'], dtype='object') # Create dataset from DataFrame df = pd.DataFrame({ "prompt": ["Hello", "Hi there"], "response": ["Hi!", "Hello!"], "score": [0.9, 0.95] }) dataset = client.datasets.create_dataset( name="greetings", dataframe=df, input_keys=["prompt"], output_keys=["response"], metadata_keys=["score"] )
- add_examples_to_dataset(*, dataset, examples=None, dataframe=None, csv_file_path=None, input_keys=(), output_keys=(), metadata_keys=(), split_keys=(), split_key=None, span_id_key=None, example_id_key=None, inputs=(), outputs=(), metadata=(), timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Append examples to an existing dataset on the Phoenix server.
- Parameters:
dataset – A dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
examples – Either a single dictionary with required ‘input’ and ‘output’ keys and an optional ‘metadata’ key, or an iterable of such dictionaries. When provided, inputs/outputs/metadata are extracted automatically.
dataframe – pandas DataFrame (requires pandas to be installed).
csv_file_path – Location of a CSV text file
input_keys – List of column names used as input keys.
output_keys – List of column names used as output keys.
metadata_keys – List of column names used as metadata keys.
split_keys – Deprecated. Use
split_keyinstead. List of column names used for automatically assigning examples to splits.split_key – Optional single column name used for assigning examples to splits. Mutually exclusive with
split_keys.span_id_key – Optional column name containing span IDs to link dataset examples back to their original traces. The column should contain OTEL span_id values (string format). Examples will be linked to spans if they exist in the database.
example_id_key – Optional column name containing user-provided IDs for examples. If not provided, the server will generate an ID for each example.
inputs – List of dictionaries each corresponding to an example.
outputs – List of dictionaries each corresponding to an example.
metadata – List of dictionaries each corresponding to an example.
timeout – Optional request timeout in seconds.
- Returns:
A Dataset object containing the updated dataset and examples.
- Raises:
ValueError – If invalid parameter combinations are provided.
ImportError – If pandas is required but not installed.
httpx.HTTPStatusError – If the API returns an error response.
- create_dataset(*, name, examples=None, dataframe=None, csv_file_path=None, input_keys=(), output_keys=(), metadata_keys=(), split_keys=(), split_key=None, span_id_key=None, example_id_key=None, inputs=(), outputs=(), metadata=(), dataset_description=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Create a new dataset by uploading examples to the Phoenix server.
- Parameters:
dataset_name – Name of the dataset.
examples – Either a single dictionary with required ‘input’ and ‘output’ keys and an optional ‘metadata’ key, or an iterable of such dictionaries. When provided, inputs/outputs/metadata are extracted automatically.
dataframe – pandas DataFrame (requires pandas to be installed).
csv_file_path – Location of a CSV text file
input_keys – List of column names used as input keys.
output_keys – List of column names used as output keys.
metadata_keys – List of column names used as metadata keys.
split_keys – Deprecated. Use
split_keyinstead. List of column names used for automatically assigning examples to splits.split_key – Optional single column name used for assigning examples to splits. Mutually exclusive with
split_keys.span_id_key – Optional column name containing span IDs to link dataset examples back to their original traces. The column should contain OTEL span_id values (string format). Examples will be linked to spans if they exist in the database.
example_id_key – Optional column name containing stable IDs for examples. If not provided, the server will generate an ID for new examples.
inputs – List of dictionaries each corresponding to an example.
outputs – List of dictionaries each corresponding to an example.
metadata – List of dictionaries each corresponding to an example.
dataset_description – Description of the dataset.
timeout – Optional request timeout in seconds.
- Returns:
A Dataset object containing the uploaded dataset and examples.
- Raises:
ValueError – If invalid parameter combinations are provided.
ImportError – If pandas is required but not installed.
httpx.HTTPStatusError – If the API returns an error response.
Example:
from phoenix.client import Client import pandas as pd client = Client() # Create dataset with span ID links spans_df = pd.DataFrame({ "input": ["What is AI?", "Explain ML"], "output": ["Artificial Intelligence is...", "Machine Learning is..."], "context.span_id": ["abc123", "def456"] }) dataset = client.datasets.create_dataset( name="my-dataset", dataframe=spans_df, input_keys=["input"], output_keys=["output"], span_id_key="context.span_id" )
- get_dataset(*, dataset, version_id=None, splits=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Retrieve a specific dataset with its examples.
Gets the dataset for a specific version, or the latest version if no version is specified. Returns the complete dataset including metadata and all examples.
- Parameters:
dataset (DatasetIdentifier) – Dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
version_id (Optional[str]) – Specific version ID of the dataset. If None, returns the latest version.
splits (Optional[list[str]]) – List of dataset split names to filter by. If provided, only returns examples that belong to the specified splits.
timeout (Optional[int]) – Request timeout in seconds (default: 5).
- Returns:
- Dataset object containing complete dataset metadata and all
examples. The dataset can be iterated over, converted to DataFrame, or accessed by index.
- Return type:
Dataset
- Raises:
ValueError – If dataset identifier format is invalid or dataset not found.
httpx.HTTPStatusError – If the API request fails.
Example:
from phoenix.client import Client client = Client() # Get dataset by name dataset = client.datasets.get_dataset(dataset="my-dataset") print(f"Dataset {dataset.name} has {len(dataset)} examples") # Get specific version versioned = client.datasets.get_dataset( dataset="my-dataset", version_id="version-123" ) # Get dataset filtered by splits train_data = client.datasets.get_dataset( dataset="my-dataset", splits=["train", "validation"] )
- get_dataset_versions(*, dataset, limit=100, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Get dataset versions as a list of dictionaries.
- Parameters:
dataset – A dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
limit – Maximum number of versions to return, starting from the most recent version
timeout – Optional request timeout in seconds.
- Returns:
version_id: The version ID
created_at: When the version was created (as datetime object)
description: Version description (if any)
metadata: Version metadata (if any)
- Return type:
List of dictionaries containing version information, including
- Raises:
ValueError – If dataset format is invalid.
httpx.HTTPStatusError – If the API returns an error response.
- list(*, limit=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
List available datasets with automatic pagination handling.
This is the recommended method for most use cases. It automatically handles pagination behind the scenes and returns a simple list of datasets. For large datasets collections, consider using a limit to control memory usage.
- Parameters:
limit – Maximum number of datasets to return. If None, returns all available datasets (use with caution for large collections).
timeout – Request timeout in seconds for each paginated request (default: 5).
- Returns:
id, name, description, metadata, created_at (datetime), updated_at (datetime), example_count (int). Limited to the requested number if limit is specified.
- Return type:
List of dataset dictionaries, each containing
- Raises:
httpx.HTTPStatusError – If any API request fails during pagination.
Example:
from phoenix.client import Client client = Client() # Get all datasets (automatically paginates, includes counts) all_datasets = client.datasets.list() print(f"Found {len(all_datasets)} total datasets") # Get datasets with example counts for dataset in all_datasets: print(f"{dataset['name']}: {dataset['example_count']} examples") # Get only first 10 datasets (efficient for large collections) limited_datasets = client.datasets.list(limit=10) print(f"Found {len(limited_datasets)} datasets (limited to 10)")
- class AsyncDatasets(client, *, _guard=None)#
Bases:
objectProvides async methods for interacting with dataset resources.
Examples
Basic usage:
from phoenix.client import AsyncClient client = AsyncClient() # Get a dataset dataset = await client.datasets.get_dataset(dataset="my-dataset") print(f"Dataset {dataset.name} has {len(dataset)} examples")
Creating and updating datasets:
# Create a new dataset dataset = await client.datasets.create_dataset( name="qa-dataset", inputs=[ {"question": "What is 2+2?"}, {"question": "What's the capital of France?"}, ], outputs=[{"answer": "4"}, {"answer": "Paris"}] ) # Add more examples later updated = await client.datasets.add_examples_to_dataset( dataset="qa-dataset", inputs=[{"question": "Who wrote Hamlet?"}], outputs=[{"answer": "Shakespeare"}] )
Working with DataFrames:
import pandas as pd # Convert dataset to DataFrame (sync operation) df = dataset.to_dataframe() print(df.columns) # Index(['input', 'output', 'metadata'], dtype='object') # Create dataset from DataFrame df = pd.DataFrame({ "prompt": ["Hello", "Hi there"], "response": ["Hi!", "Hello!"], "score": [0.9, 0.95] }) dataset = await client.datasets.create_dataset( name="greetings", dataframe=df, input_keys=["prompt"], output_keys=["response"], metadata_keys=["score"] )
- async add_examples_to_dataset(*, dataset, examples=None, dataframe=None, csv_file_path=None, input_keys=(), output_keys=(), metadata_keys=(), split_keys=(), split_key=None, span_id_key=None, example_id_key=None, inputs=(), outputs=(), metadata=(), timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Append examples to an existing dataset on the Phoenix server.
- Parameters:
dataset – A dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
examples – Either a single dictionary with required ‘input’ and ‘output’ keys and an optional ‘metadata’ key, or an iterable of such dictionaries. When provided, inputs/outputs/metadata are extracted automatically.
dataframe – pandas DataFrame (requires pandas to be installed).
csv_file_path – Location of a CSV text file
input_keys – List of column names used as input keys.
output_keys – List of column names used as output keys.
metadata_keys – List of column names used as metadata keys.
split_keys – Deprecated. Use
split_keyinstead. List of column names used for automatically assigning examples to splits.split_key – Optional single column name used for assigning examples to splits. Mutually exclusive with
split_keys.inputs – List of dictionaries each corresponding to an example.
outputs – List of dictionaries each corresponding to an example.
metadata – List of dictionaries each corresponding to an example.
timeout – Optional request timeout in seconds.
- Returns:
A Dataset object containing the updated dataset and examples.
- Raises:
ValueError – If invalid parameter combinations are provided.
ImportError – If pandas is required but not installed.
httpx.HTTPStatusError – If the API returns an error response.
- async create_dataset(*, name, examples=None, dataframe=None, csv_file_path=None, input_keys=(), output_keys=(), metadata_keys=(), split_keys=(), split_key=None, span_id_key=None, example_id_key=None, inputs=(), outputs=(), metadata=(), dataset_description=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Create a new dataset by uploading examples to the Phoenix server.
- Parameters:
dataset_name – Name of the dataset.
examples – Either a single dictionary with required ‘input’ and ‘output’ keys and an optional ‘metadata’ key, or an iterable of such dictionaries. to add. When provided, inputs/outputs/metadata are extracted automatically.
dataframe – pandas DataFrame (requires pandas to be installed).
csv_file_path – Location of a CSV text file
input_keys – List of column names used as input keys.
output_keys – List of column names used as output keys.
metadata_keys – List of column names used as metadata keys.
split_keys – Deprecated. Use
split_keyinstead. List of column names used for automatically assigning examples to splits.split_key – Optional single column name used for assigning examples to splits. Mutually exclusive with
split_keys.inputs – List of dictionaries each corresponding to an example.
outputs – List of dictionaries each corresponding to an example.
metadata – List of dictionaries each corresponding to an example.
dataset_description – Description of the dataset.
timeout – Optional request timeout in seconds.
- Returns:
A Dataset object containing the uploaded dataset and examples.
- Raises:
ValueError – If invalid parameter combinations are provided.
ImportError – If pandas is required but not installed.
httpx.HTTPStatusError – If the API returns an error response.
- async get_dataset(*, dataset, version_id=None, splits=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Gets the dataset for a specific version, or gets the latest version of the dataset if no version is specified.
- Parameters:
dataset – A dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
version_id – An ID for the version of the dataset, or None.
splits – Optional list of dataset split names to filter by. If provided, only returns examples that belong to the specified splits.
timeout – Optional request timeout in seconds.
- Returns:
A Dataset object containing the dataset metadata and examples.
- Raises:
ValueError – If dataset format is invalid.
httpx.HTTPStatusError – If the API returns an error response.
- async get_dataset_versions(*, dataset, limit=100, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
Get dataset versions as a list of dictionaries.
- Parameters:
dataset – A dataset identifier - can be a dataset ID string, name string, Dataset object, or dict with ‘id’/’name’ fields.
limit – Maximum number of versions to return, starting from the most recent version
timeout – Optional request timeout in seconds.
- Returns:
version_id: The version ID
created_at: When the version was created (as datetime object)
description: Version description (if any)
metadata: Version metadata (if any)
- Return type:
List of dictionaries containing version information, including
- Raises:
ValueError – If dataset format is invalid.
httpx.HTTPStatusError – If the API returns an error response.
- async list(*, limit=None, timeout=DEFAULT_TIMEOUT_IN_SECONDS)#
List available datasets with automatic pagination handling.
This is the recommended method for most use cases. It automatically handles pagination behind the scenes and returns a simple list of datasets. For large datasets collections, consider using a limit to control memory usage.
- Parameters:
limit – Maximum number of datasets to return. If None, returns all available datasets (use with caution for large collections).
timeout – Request timeout in seconds for each paginated request (default: 5).
- Returns:
id, name, description, metadata, created_at (datetime), updated_at (datetime), example_count (int). Limited to the requested number if limit is specified.
- Return type:
List of dataset dictionaries, each containing
- Raises:
httpx.HTTPStatusError – If any API request fails during pagination.
Example:
from phoenix.client import AsyncClient client = AsyncClient() # Get all datasets (automatically paginates) all_datasets = await client.datasets.list() print(f"Found {len(all_datasets)} total datasets") # Get only first 10 datasets (efficient for large collections) limited_datasets = await client.datasets.list(limit=10) print(f"Found {len(limited_datasets)} datasets (limited to 10)")