Evals#

LLM Interfaces#

LLM#

class LLM(*, provider=None, model=None, client=None, initial_per_second_request_rate=None, sync_client_kwargs=None, async_client_kwargs=None, **kwargs)#

Bases: object

An LLM wrapper that simplifies the API for generating text and objects.

This wrapper delegates API access to SDK/client libraries that are installed in the active Python environment. To show supported providers, use show_provider_availability().

The LLM class provides both synchronous and asynchronous methods for all operations.

Examples:

from phoenix.evals.llm import LLM, show_provider_availability
show_provider_availability()
llm = LLM(provider="openai", model="gpt-4o")
llm.generate_text(prompt="Hello, world!")
"Hello, world!"
llm.generate_object(
    prompt="Hello, world!",
    schema={
        "type": "object",
        "properties": {
            "text": {"type": "string"}
        },
        "required": ["text"]
    })
{"text": "Hello, world!"}
async async_generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#

Asynchronously generate a classification given a prompt and a set of labels.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.

  • labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions

  • include_explanation (bool) – Whether to prompt the LLM for an explanation.

  • description (Optional[str]) – A description of the classification task.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated classification.

Return type:

Dict[str, Any]

async async_generate_object(prompt, schema, tracer=None, **kwargs)#

Asynchronously generate an object given a prompt and a schema.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.

  • schema (Dict[str, Any]) – A JSON schema that describes the generated object.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated object.

Return type:

Dict[str, Any]

async async_generate_text(prompt, tracer=None, **kwargs)#

Asynchronously generate text given a prompt.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.

  • tracer (Optional[Tracer]) – The tracer to use for tracing.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated text.

Return type:

str

generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#

Generate a classification given a prompt and a set of labels.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.

  • labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions

  • include_explanation (bool) – Whether to prompt the LLM for an explanation.

  • description (Optional[str]) – A description of the classification task.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated classification.

Return type:

Dict[str, Any]

Examples:

from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o", client="openai")
llm.generate_classification(
    prompt="Hello, world!",
    labels=["yes", "no"],
)
{"label": "yes", "explanation": "The answer is yes."}
llm.generate_classification(
    prompt="Hello, world!",
    labels={"yes": "Positive response", "no": "Negative response"},
    include_explanation=False,
)
{"label": "yes"}
generate_object(prompt, schema, tracer=None, **kwargs)#

Generate an object given a prompt and a schema.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.

  • schema (Dict[str, Any]) – A JSON schema that describes the generated object.

  • tracer (Optional[Tracer]) – Optional tracer for tracing operations.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated object.

Return type:

Dict[str, Any]

generate_text(prompt, tracer=None, **kwargs)#

Generate text given a prompt.

Parameters:
  • prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.

  • tracer (Optional[Tracer]) – Optional tracer for tracing operations.

  • **kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated text.

Return type:

str

Prompt Template#

class Template(*, template, template_format=None)#

Bases: object

Template for rendering prompts with mustache ({{variable}}) or f-string ({variable}) formats.

Supports auto-detection of template format and handles JSON content correctly.

Deprecated since version Template: is deprecated. Use PromptTemplate instead, which supports both string templates and message lists (OpenAI-style format).

render(variables, tracer=None)#

Render the template with the given variables.

Parameters:
  • variables (Dict[str, Any]) – The variables to substitute into the template.

  • tracer (Optional[Tracer]) – Optional tracer for tracing operations.

Returns:

The rendered template.

Return type:

str

Raises:

TypeError – If variables is not a dictionary.

property variables#

Get the list of variables used in the template.

Returns:

A list of variable names found in the template.

Return type:

List[str]

Evaluator Abstractions#

Evaluator Base#

class Evaluator(*, name, kind, direction='maximize', input_schema=None)#

Bases: ABC

Core abstraction for evaluators.

Supports single-record synchronous (evaluate) and asynchronous (async_evaluate) modes with optional per-call field_mapping.

Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.

Parameters:
  • name – The name of this evaluator, used for identification and Score naming.

  • kind – The kind of this evaluator (human, llm, or code).

  • input_schema – Optional Pydantic BaseModel for input typing and validation. If None, subclasses infer fields from prompts or function signatures and may construct a model dynamically.

  • direction – The direction for score optimization (“maximize” or “minimize”). Defaults to “maximize”.

async async_evaluate(eval_input, input_mapping=None)#

Async variant of evaluate. Validates and remaps input as described in evaluate.

Returns:

A list of Score objects.

bind(input_mapping)#

Binds an evaluator with a fixed input mapping.

describe()#

Return a JSON-serializable description of the evaluator, including its name, kind, direction, and input fields derived from the Pydantic input schema when available.

property direction#

The direction for score optimization.

evaluate(eval_input, input_mapping=None)#

Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.

Returns:

A list of Score objects.

property input_schema#

Read-only Pydantic input schema for this evaluator, if set.

property kind#

The kind of this evaluator.

property name#

The name of this evaluator.

property source#

The source of this evaluator (deprecated).

unbind()#

Unbinds an evaluator from an input mapping.

LLMEvaluator#

class LLMEvaluator(*, name, llm, prompt_template, schema=None, input_schema=None, direction='maximize', **kwargs)#

Bases: Evaluator

Base LLM evaluator that infers required input fields from its prompt template and constructs a default Pydantic input schema when none is supplied.

Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.

Parameters:
  • name – Identifier for this evaluator and the name used in produced Scores.

  • llm – The LLM instance to use for evaluation.

  • prompt_template – The prompt template with placeholders for required fields; used to infer required variables. Can be either a string template or a list of message dictionaries (for chat-based models).

  • schema – Optional tool/JSON schema for structured output when supported by the LLM.

  • input_schema – Optional Pydantic model describing/validating inputs. If not provided, a model is dynamically created from the prompt variables (all str, required).

  • direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

  • **kwargs – Invocation parameters forwarded to the LLM client

async async_evaluate(eval_input, input_mapping=None)#

Async variant of evaluate. Validates and remaps input as described in evaluate.

Returns:

A list of Score objects.

evaluate(eval_input, input_mapping=None)#

Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.

Returns:

A list of Score objects.

property prompt_template#

Get the prompt template.

ClassificationEvaluator#

class ClassificationEvaluator(*, name, llm, prompt_template, choices, include_explanation=True, input_schema=None, direction='maximize', **kwargs)#

Bases: LLMEvaluator

LLM-based evaluator for classification-style judgements.

Supports label-only or label+score mappings, and returns explanations by default. Note: Requires the LLM to have tool calling or structured output capabilities.

Parameters:
  • name – Identifier for this evaluator and the name used in produced Scores.

  • llm – The LLM instance to use for evaluation. Must support tool calling or structured output for reliable classification.

  • prompt_template – The prompt template with placeholders for required input fields. Can be either a string template or a list of message dictionaries (for chat-based models). Template variables are inferred automatically.

  • choices

    Classification choices in one of three formats: a. List[str]: Simple list of label names (e.g., [“positive”, “negative”]).

    Scores will be None.

    1. Dict[str, Union[float, int]]: Labels mapped to numeric scores

      (e.g., {“positive”: 1.0, “negative”: 0.0}).

    2. Dict[str, Tuple[Union[float, int], str]]: Labels mapped to tuples of

      (score, description) (e.g., {“positive”: (1.0, “Positive sentiment”), “negative”: (0.0, “Negative sentiment”)}). Not recommended as LLMs do not reliably follow this schema.

  • include_explanation – Whether to request explanations for classification decisions. Defaults to True in accordance with best practices.

  • input_schema – Optional Pydantic model for input validation. If not provided, a model is automatically created from prompt template variables.

  • direction – Score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

  • **kwargs – Invocation parameters forwarded to the LLM client

Returns:

A list containing a single Score object with the classification

result, including label, optional score, and optional explanation.

Return type:

List[Score]

Examples

Classification with labels only:

from phoenix.evals import ClassificationEvaluator, LLM

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=LLM(provider="openai", model="gpt-4"),
    prompt_template="Classify the sentiment of this text: {text}",
    choices=["positive", "negative", "neutral"]
)

result = evaluator.evaluate({"text": "I love this product!"})
print(result[0].label)  # "positive"
print(result[0].explanation)  # LLM's reasoning
print(result[0].score)  # None

Classification with scores:

# Map labels to numeric scores
evaluator = ClassificationEvaluator(
    name="quality",
    llm=llm,
    prompt_template="Rate the quality of this response: {response}",
    choices={
        "excellent": 5,
        "good": 4,
        "fair": 3,
        "poor": 2,
        "terrible": 1
    }
)

result = evaluator.evaluate({"response": "Great explanation with examples"})
print(result[0].label)  # "excellent"
print(result[0].score)  # 5

Classification with scores and descriptions (use with caution):

# Map labels to (score, description) tuples
evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="How relevant is this answer to the question?\n"
                   "Question: {question}\nAnswer: {answer}",
    choices={
        "highly_relevant": (1.0, "Answer directly addresses the question"),
        "somewhat_relevant": (0.5, "Answer partially addresses the question"),
        "not_relevant": (0.0, "Answer does not address the question")
    }
)

result = evaluator.evaluate({
    "question": "What is the capital of France?",
    "answer": "Paris is the capital city of France."
})
print(result[0].label)  # "highly_relevant"
print(result[0].score)  # 1.0

Core Functions#

create_evaluator#

create_evaluator(name, source=None, direction='maximize', kind=None)#

Decorator that turns a simple function into an Evaluator instance.

The decorated function should accept keyword args matching its required fields and return a value that can be converted to a Score. The returned object is an Evaluator with full support for evaluate/async_evaluate and maintains direct callability.

Parameters:
  • name – Identifier for the evaluator and the name used in produced Scores.

  • kind – The kind of this evaluator (“human”, “llm”, or “code”). Defaults to “code”.

  • direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

Examples

Basic usage with numeric return:

from phoenix.evals import create_evaluator

@create_evaluator(name="precision")
def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:
    # Calculate precision for information retrieval
    relevant_set = set(relevant_documents)
    hits = sum(1 for doc in retrieved_documents if doc in relevant_set)
    return hits / len(retrieved_documents) if retrieved_documents else 0.0

# Use the evaluator
result = precision.evaluate({
    "retrieved_documents": [1, 2, 3, 4],
    "relevant_documents": [2, 4, 6]
})
print(result[0].score)  # 0.5

# Direct callability maintained:
result = precision(retrieved_documents=[1, 2, 3, 4], relevant_documents=[2, 4, 6])
print(result)  # 0.5

Different return types:

# Boolean return (converted to score and label)
@create_evaluator(name="is_valid")
def is_valid(text: str) -> bool:
    return len(text.strip()) > 0

# Dictionary return with multiple fields
@create_evaluator(name="positive_sentiment")
def positive_sentiment(text: str) -> dict:
    # Simplified sentiment analysis
    positive_words = ["good", "great", "excellent"]
    score = sum(1 for word in positive_words if word in text.lower())
    return {
        "score": score / len(positive_words),
        "label": "positive" if score > 0 else "neutral",
        "explanation": f"Found {score} positive indicators"
    }

# Tuple return (score, label, explanation)
@create_evaluator(name="length_check")
def length_check(text: str) -> tuple:
    length = len(text)
    is_good = 10 <= length <= 100
    return (float(is_good), "good" if is_good else "bad", f"Length: {length}")

Using with dataframes:

import pandas as pd
from phoenix.evals import evaluate_dataframe

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

df = pd.DataFrame({
    "text": ["Hello world", "This is a longer sentence", "Short"]
})

results_df = evaluate_dataframe(dataframe=df, evaluators=[word_count])
print(results_df["word_count_score"])  # JSON scores for each row

Notes

The decorated function can return:

  • A Score object (no conversion needed)

  • A number (converted to Score.score)

  • A boolean (converted to integer Score.score and string Score.label)

  • A short string (≤3 words, converted to Score.label)

  • A long string (≥4 words, converted to Score.explanation)

  • A dictionary with keys “score”, “label”, or “explanation”

  • A tuple of values (only bool, number, str types allowed)

An input_schema is automatically created from the function signature, capturing the required input fields, their types, and any defaults. For best results, do not use *args or **kwargs.

The decorator automatically handles conversion to a valid Score object.

create_classifier#

create_classifier(name, prompt_template, llm, choices, direction='maximize')#

Factory to create a ClassificationEvaluator.

Note: The evaluator requires the LLM to have tool calling or structured output capabilities.

Parameters:
  • name – Identifier for this evaluator and the name used in produced Scores.

  • llm – The LLM instance to use for evaluation.

  • prompt_template – Prompt template string with placeholders for inputs.

  • choices – One of List[str], Dict[str, number], or Dict[str, Tuple[number, str]] describing classification labels (and optional scores/descriptions).

  • direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

Returns:

A ClassificationEvaluator instance.

Examples

Creating a simple sentiment classifier:

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4")

sentiment_evaluator = create_classifier(
    name="sentiment",
    prompt_template="Analyze the sentiment: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"]
)

result = sentiment_evaluator.evaluate({"text": "Great product!"})
print(result[0].label)  # "positive"
print(result[0].score)  # None

Creating a classifier with numeric scores:

quality_evaluator = create_classifier(
    name="response_quality",
    prompt_template="Rate this response quality: {response}",
    llm=llm,
    choices={
        "excellent": 5,
        "good": 4,
        "average": 3,
        "poor": 2,
        "terrible": 1
    }
)

result = quality_evaluator.evaluate({"response": "Detailed and helpful answer"})
print(f"Quality: {result[0].label} (Score: {result[0].score})")

Creating a classifier with scores and descriptions:

accuracy_evaluator = create_classifier(
    name="factual_accuracy",
    prompt_template="Check factual accuracy: {claim}",
    llm=llm,
    choices={
        "accurate": (1.0, "Factually correct information"),
        "partially_accurate": (0.5, "Some correct, some incorrect information"),
        "inaccurate": (0.0, "Factually incorrect information")
    }
)

result = accuracy_evaluator.evaluate({"claim": "Paris is the capital of France"})
print(f"Accuracy: {result[0].label} (Score: {result[0].score})")

bind_evaluator#

bind_evaluator(evaluator, input_mapping)#

Helper to bind an evaluator with a fixed input mapping.

This function allows you to create a version of an evaluator that automatically maps input data fields to the evaluator’s expected field names. This is useful when your data schema doesn’t match the evaluator’s expected inputs.

Parameters:
  • evaluator – The evaluator instance to bind.

  • input_mapping – A dictionary mapping evaluator field names to either: - String keys for direct field mapping - Callable functions for computed field mapping

Returns:

The same evaluator instance with the input mapping bound.

Examples

Basic field mapping:

from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="text_length")
def text_length(content: str) -> int:
    return len(content)

# Map 'message' field to 'content' parameter
mapping = {"content": "message"}
bound_evaluator = bind_evaluator(evaluator=text_length, input_mapping=mapping)

# Now we can use 'message' instead of 'content'
result = bound_evaluator.evaluate({"message": "Hello world"})
print(result[0].score)  # 11

Using lambda functions for computed mappings:

@create_evaluator(name="precision")
def precision(retrieved_docs: list, relevant_docs: list) -> float:
    relevant_set = set(relevant_docs)
    hits = sum(1 for doc in retrieved_docs if doc in relevant_set)
    return hits / len(retrieved_docs) if retrieved_docs else 0.0

# Convert single document to list format
mapping = {
    "retrieved_docs": "retrieved_documents",
    "relevant_docs": lambda x: [x["expected_document"]]
}
bound_evaluator = bind_evaluator(evaluator=precision, input_mapping=mapping)

data = {
    "retrieved_documents": [1, 2, 3],
    "expected_document": 2
}
result = bound_evaluator.evaluate(data)

Complex data transformation:

@create_evaluator(name="response_quality")
def response_quality(question: str, answer: str, context: str) -> dict:
    # Simplified quality check
    has_context = context.lower() in answer.lower()
    return {
        "score": 1.0 if has_context else 0.0,
        "label": "good" if has_context else "poor",
        "explanation": ("Answer uses context" if has_context
                       else "Answer ignores context")
    }

# Map nested data structure
mapping = {
    "question": "query",
    "answer": "response.text",
    "context": lambda x: " ".join(x["documents"])
}
bound_evaluator = bind_evaluator(evaluator=response_quality, input_mapping=mapping)

data = {
    "query": "What is the capital?",
    "response": {"text": "Paris is the capital of France"},
    "documents": ["France info", "Paris is the capital"]
}
result = bound_evaluator.evaluate(data)

evaluate_dataframe#

evaluate_dataframe(dataframe, evaluators, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#

Evaluate a dataframe with a list of evaluators and return an augmented dataframe.

This function uses a synchronous executor; for async evaluation, use async_evaluate_dataframe.

Parameters:
  • dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.

  • evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.

  • tqdm_bar_format – Optional format string for the progress bar. If None and hide_tqdm_bar is False, the default progress bar formatter is used.

  • hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.

  • exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses SyncExecutor’s default (True).

  • max_retries – Optional number of times to retry on exceptions. If None, uses SyncExecutor’s default (10).

Returns:

A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution

time, and status.

  • ”{score.name}_score”: JSON-serialized Score objects for each score returned

Examples

Basic dataframe evaluation:

import pandas as pd
from phoenix.evals import create_evaluator, evaluate_dataframe

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

@create_evaluator(name="has_question")
def has_question(text: str) -> bool:
    return "?" in text

df = pd.DataFrame({
    "text": [
        "Hello world",
        "How are you today?",
        "This is a longer sentence with multiple words"
    ]
})

evaluators = [word_count, has_question]
results_df = evaluate_dataframe(dataframe=df, evaluators=evaluators, hide_tqdm_bar=True)

# Results include original columns plus score columns
print(results_df.columns)
# ['text', 'word_count_execution_details', 'has_question_execution_details',
#  'word_count_score', 'has_question_score']

Using with input mapping:

from phoenix.evals import bind_evaluator

@create_evaluator(name="response_length")
def response_length(response: str) -> int:
    return len(response)

# Data has 'answer' column but evaluator expects 'response'
mapping = {"response": "answer"}
bound_evaluator = bind_evaluator(evaluator=response_length, input_mapping=mapping)

df = pd.DataFrame({
    "question": ["What is AI?", "How does ML work?"],
    "answer": ["AI is artificial intelligence",
              "ML uses algorithms to learn patterns"]
})

results_df = evaluate_dataframe(dataframe=df, evaluators=[bound_evaluator])

With progress bar and error handling:

results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=evaluators,
    tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]",
    exit_on_error=False,  # Continue on errors
    max_retries=3
)

# Check for evaluation errors
import json
for idx, row in results_df.iterrows():
    details = json.loads(row['word_count_execution_details'])
    if details['status'] != 'success':
        print(f"Row {idx} failed: {details['exceptions']}")

Notes

  • Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.

  • Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.

  • Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

async_evaluate_dataframe#

async_evaluate_dataframe(dataframe, evaluators, concurrency=None, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#

Evaluate a dataframe with a list of evaluators and return an augmented dataframe.

This function uses an asynchronous executor; for sync evaluation, use evaluate_dataframe.

Parameters:
  • dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.

  • evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.

  • concurrency – Optional number of concurrent consumers. If None, uses AsyncExecutor’s default (3).

  • tqdm_bar_format – Optional format string for the progress bar. If None, use the default formatter.

  • hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.

  • exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses AsyncExecutor’s default (True).

  • max_retries – Optional number of times to retry on exceptions. If None, uses AsyncExecutor’s default (10).

Returns:

A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution

time, and status.

  • ”{score.name}_score”: JSON-serialized Score objects for each score returned

Examples

Basic async evaluation:

import asyncio
import pandas as pd
from phoenix.evals import create_evaluator, async_evaluate_dataframe

@create_evaluator(name="text_analysis")
def text_analysis(text: str) -> dict:
    return {
        "score": len(text.split()),
        "label": "long" if len(text) > 50 else "short"
    }

df = pd.DataFrame({
    "text": [
        "Short text",
        "This is a much longer text that contains many more words and characters",
        "Medium length text here"
    ]
})

async def main():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=[text_analysis],
        concurrency=5  # Process up to 5 rows concurrently
        hide_tqdm_bar=True,
    )
    return results_df

results_df = asyncio.run(main())
print(results_df.columns)

With LLM evaluators:

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4")

sentiment_evaluator = create_classifier(
    name="sentiment",
    prompt_template="Classify sentiment: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"]
)

df = pd.DataFrame({
    "text": [
        "I love this product!",
        "This is terrible quality",
        "It's okay, nothing special"
    ]
})

async def evaluate_sentiment():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=[sentiment_evaluator],
        concurrency=2,  # Limit concurrent LLM calls
        tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}"
    )
    return results_df

results_df = asyncio.run(evaluate_sentiment())

Error handling and retries:

async def robust_evaluation():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=evaluators,
        concurrency=3,
        exit_on_error=False,  # Continue despite errors
        max_retries=5,        # Retry failed evaluations
        tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]"
    )

    # Check for failures
    import json
    failed_rows = []
    for idx, row in results_df.iterrows():
        details = json.loads(row['sentiment_execution_details'])
        if details['status'] != 'success':
            failed_rows.append(idx)

    print(f"Failed evaluations: {len(failed_rows)} out of {len(results_df)}")
    return results_df

results_df = asyncio.run(robust_evaluation())

Notes

  • Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.

  • Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.

  • Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

Score#

Score#

class Score(*, name=None, score=None, label=None, explanation=None, metadata=None, direction='maximize', kind=None)#

Bases: object

Represents the result of an evaluation.

A Score contains the evaluation result along with metadata about the evaluation. It can represent numeric scores, categorical labels, explanations, or combinations thereof.

Examples

Creating different types of scores:

from phoenix.evals.evaluators import Score

# Numeric score only
numeric_score = Score(
    name="accuracy",
    score=0.85,
    kind="llm",
    direction="maximize"
)

# Label only (categorical)
label_score = Score(
    name="sentiment",
    label="positive",
    kind="llm",
    direction="maximize"
)

# Score with explanation
detailed_score = Score(
    name="relevance",
    score=0.9,
    label="highly_relevant",
    explanation="The answer directly addresses all aspects of the question",
    metadata={"model": "gpt-4", "confidence": 0.95},
    kind="llm",
    direction="maximize"
)

# Boolean evaluation
boolean_score = Score(
    name="has_citation",
    score=1.0,
    label="true",
    explanation="Found 3 citations in the text",
    kind="code",
    direction="maximize"
)
pretty_print(indent=2)#

Pretty print the Score as formatted JSON.

Parameters:

indent – Number of spaces for indentation. Defaults to 2.

property source#

The source of this score (deprecated).

to_dict()#

Convert the Score to a dictionary, excluding None values.

Returns:

A dictionary representation of the Score with None values excluded.

Built-in Metrics#

class ConcisenessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for assessing whether model outputs are concise and free of unnecessary content.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether the output to an input is concise or verbose.

  • Returns one Score with label (concise or verbose), score (1.0 if concise, 0.0 if verbose), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.conciseness import ConcisenessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
conciseness_eval = ConcisenessEvaluator(llm=llm)

# With custom invocation parameters
conciseness_eval = ConcisenessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris.",
    }
scores = conciseness_eval.evaluate(eval_input)
print(scores)
[Score(name='conciseness', score=1.0, label='concise',
    explanation='The response directly answers the question with no extra words.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]
CHOICES = {'concise': 1.0, 'verbose': 0.0}#
class ConcisenessInputSchema(*, input, output)#

Bases: BaseModel

input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
DIRECTION = 'maximize'#
NAME = 'conciseness'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class CorrectnessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for assessing factual accuracy and completeness of model outputs.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether the output to an input is correct or incorrect.

  • Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.correctness import CorrectnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
correctness_eval = CorrectnessEvaluator(llm=llm)

# With custom invocation parameters
correctness_eval = CorrectnessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    }
scores = correctness_eval.evaluate(eval_input)
print(scores)
[Score(name='correctness', score=1.0, label='correct',
    explanation='The response accurately states that Paris is the capital of France.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]
CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
class CorrectnessInputSchema(*, input, output)#

Bases: BaseModel

input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
DIRECTION = 'maximize'#
NAME = 'correctness'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class DocumentRelevanceEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for determining document relevance to a given question.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether a document contains information relevant to answering a specific question.

  • Returns one Score with label (relevant or unrelated), score (1.0 if relevant, 0.0 if unrelated), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.document_relevance import DocumentRelevanceEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
relevance_eval = DocumentRelevanceEvaluator(llm=llm)

# With custom invocation parameters
relevance_eval = DocumentRelevanceEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "document_text": "Paris is the capital and largest city of France"
    }
scores = relevance_eval.evaluate(eval_input)
print(scores)
CHOICES = {'relevant': 1.0, 'unrelated': 0.0}#
DIRECTION = 'maximize'#
class DocumentRelevanceInputSchema(*, input, document_text)#

Bases: BaseModel

document_text#
input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

NAME = 'document_relevance'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class FaithfulnessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for detecting faithfulness in grounded LLM responses.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether the output to an input is faithful or unfaithful based on the context.

  • Returns one Score with label (faithful or unfaithful), score (1.0 if faithful, 0.0 if unfaithful), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.faithfulness import FaithfulnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

# With custom invocation parameters
faithfulness_eval = FaithfulnessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
    }
scores = faithfulness_eval.evaluate(eval_input)
print(scores)
[Score(name='faithfulness', score=1.0, label='faithful',
    explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]
CHOICES = {'faithful': 1.0, 'unfaithful': 0.0}#
DIRECTION = 'maximize'#
class FaithfulnessInputSchema(*, input, output, context)#

Bases: BaseModel

context#
input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
NAME = 'faithfulness'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class HallucinationEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for detecting hallucinations in grounded LLM responses.

Deprecated since version HallucinationEvaluator: is deprecated. Please use FaithfulnessEvaluator instead. The new evaluator uses ‘faithful’/’unfaithful’ labels and maximizes score (1.0=faithful).

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether the output to an input is factual or hallucinated based on the context.

  • Returns one Score with label (factual or hallucinated), score (1.0 if hallucinated, 0.0 if factual), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.hallucination import HallucinationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
hallucination_eval = HallucinationEvaluator(llm=llm)

# With custom invocation parameters
hallucination_eval = HallucinationEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
    }
scores = hallucination_eval.evaluate(eval_input)
print(scores)
[Score(name='hallucination', score=0.0, label='factual',
    explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="minimize")]
CHOICES = {'factual': 0.0, 'hallucinated': 1.0}#
DIRECTION = 'minimize'#
class HallucinationInputSchema(*, input, output, context)#

Bases: BaseModel

context#
input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
NAME = 'hallucination'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class MatchesRegex(pattern, name=None, include_explanation=True)#

Bases: Evaluator

Evaluates whether text output matches a specified regular expression pattern.

This code evaluator checks if the output contains one or more substrings that match a given regex pattern. It returns a binary score (1.0 for match, 0.0 for no match) along with an explanation of which substrings matched or that no match was found.

Parameters:
  • pattern – The regular expression pattern to match against. Can be provided as a string or a compiled Pattern object.

  • name – Optional custom name for the evaluator. If not provided, defaults to “matches_regex”.

  • include_explanation – Whether to include an explanation in the Score object. Defaults to True.

Examples

Basic usage with URL detection:

import re
from phoenix.evals.metrics.matches_regex import MatchesRegex

# Compiled regex pattern
pattern = re.compile(r"https?://[^\s]+")
contains_link = MatchesRegex(pattern=pattern)

eval_input = {"output": "Check out https://github.com/Arize-ai/phoenix!"}

scores = contains_link.evaluate(eval_input)
print(scores)
# [Score(name='matches_regex',
#        score=1.0,
#        label=None,
#        explanation='There are 1 matches for the regex: https?://[^\s]+',
#        metadata={},
#        kind='code',
#        direction='maximize')]
class InputSchema(*, output)#

Bases: BaseModel

model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
class PrecisionRecallFScore(*, beta=1.0, average='macro', zero_division=0.0, positive_label=None)#

Bases: Evaluator

Code evaluator that computes precision, recall, and F-beta score given lists of expected and output labels.

Parameters:
  • beta (-) – Weight of recall relative to precision. Must be > 0. Defaults to 1.0 (F1).

  • average (-) – Aggregation strategy across classes. One of {‘macro’,’micro’,’weighted’}. Defaults to ‘macro’. Suffixes are only appended to metric names when a non-default average is used.

  • positive_label (-) – When set, compute binary precision/recall/F exclusively for this label (one-vs-rest). If None and labels are numeric with unique set {0,1}, the positive label defaults to 1. Otherwise, multi-class averaging is used.

  • zero_division (-) – Value to use when a metric is undefined (e.g., 0/0). Defaults to 0.0.

  • eval_input (Mapping[str, Any]) – Two lists of hashable labels: - expected (Sequence[Hashable]): Expected/true sequence of labels - output (Sequence[Hashable]): Output/predicted sequence of labels

Returns:

A list of Score objects.

Return type:

List[Score]

Raises:

ValueError – If input validation fails.

Notes

  • Supports labels as strings or integers (must be hashable)

  • Supports both binary and multi-class classification via averaging strategies

  • Score Naming:
    • Defaults (beta=1.0, average=”macro”): names are precision, recall, and f1.

    • Non-default average: e.g., precision_micro, recall_weighted, f0_5_micro.

Examples

  1. Multi-class (macro):

    evaluator = PrecisionRecallFScore(beta=1.0, average="macro")
    eval_input = {"expected": ["cat", "dog", "cat", "bird"],
                  "output": ["cat", "cat", "cat", "bird"]}
    scores = evaluator(eval_input)
    [s.name for s in scores]
    ['precision', 'recall', 'f1']
    
  2. Binary with explicit positive label:

    evaluator = PrecisionRecallFScore(beta=0.5, positive_label="spam")
    eval_input = {"expected": ["spam", "ham", "spam"],
                  "output": ["spam", "spam", "ham"]}
    scores = evaluator(eval_input)
    [s.name for s in scores]
    ['precision', 'recall', 'f0_5']
    
class InputSchema(*, expected, output)#

Bases: BaseModel

expected#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
class RefusalEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for detecting when an LLM refuses or declines to answer a query.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Detects refusals, deflections, scope disclaimers, and non-answers.

  • Returns one Score with label (refused or answered), score (1.0 if refused, 0.0 if answered), and an explanation from the LLM judge.

  • This metric is use-case agnostic: it only detects whether a refusal occurred, not whether the refusal was appropriate.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.refusal import RefusalEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
refusal_eval = RefusalEvaluator(llm=llm)

# With custom invocation parameters
refusal_eval = RefusalEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "I'm sorry, I can only help with technical questions.",
    }
scores = refusal_eval.evaluate(eval_input)
print(scores)
[Score(name='refusal', score=1.0, label='refused',
    explanation='The response refuses to answer by claiming scope limitations.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="neutral")]
CHOICES = {'answered': 0.0, 'refused': 1.0}#
DIRECTION = 'neutral'#
NAME = 'refusal'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class RefusalInputSchema(*, input, output)#

Bases: BaseModel

input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
class ToolInvocationEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

Determines if a tool was invoked correctly with proper arguments, formatting, and safe content.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether an AI agent’s tool invocation was correct or incorrect based on the conversation context, available tool schemas, and the agent’s tool invocation(s).

  • This metric evaluates the correctness of the tool invocation (arguments, formatting, safety), not the correctness of the tool selection itself.

  • Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Criteria for Correct Invocation:
  • JSON is properly structured (if applicable).

  • All required fields/parameters are present.

  • No hallucinated or nonexistent fields (all fields exist in the tool schema).

  • Argument values match the user query and schema expectations.

  • No unsafe content (e.g., PII) in arguments.

Criteria for Incorrect Invocation:
  • Hallucinated or nonexistent fields not in the schema.

  • Missing required fields/parameters.

  • Improperly formatted or malformed JSON.

  • Incorrect, hallucinated, or mismatched argument values.

  • Unsafe content (e.g., PII, sensitive data) in arguments.

Examples:

from phoenix.evals.metrics.tool_invocation import ToolInvocationEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_invocation_eval = ToolInvocationEvaluator(llm=llm)

# With custom invocation parameters
tool_invocation_eval = ToolInvocationEvaluator(llm=llm, temperature=0.0)

# Example with JSON schema format for available tools
eval_input = {
    "input": "User: Book a flight from NYC to LA for tomorrow",
    "available_tools": '''
    {
        "name": "book_flight",
        "description": "Book a flight between two cities",
        "parameters": {
            "type": "object",
            "properties": {
                "origin": {"type": "string", "description": "Departure city code"},
                "destination": {"type": "string", "description": "Arrival city code"},
                "date": {"type": "string", "description": "Flight date in YYYY-MM-DD"}
            },
            "required": ["origin", "destination", "date"]
        }
    }
    ''',
    "tool_selection": '''
    book_flight(origin="NYC", destination="LA", date="2024-01-15")
    '''
}
scores = tool_invocation_eval.evaluate(eval_input)
print(scores)

# Example with human-readable format for available tools
eval_input_readable = {
    "input": "User: What's the weather in San Francisco?",
    "available_tools": '''
    WeatherTool:
      Description: Get the current weather for a location
      Parameters:
        - location (required): The city name or coordinates
        - units (optional): Temperature units (celsius or fahrenheit)
    ''',
    "tool_selection": "WeatherTool(location='San Francisco', units='fahrenheit')"
}
scores = tool_invocation_eval.evaluate(eval_input_readable)
print(scores)
CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
DIRECTION = 'maximize'#
NAME = 'tool_invocation'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class ToolInvocationInputSchema(*, input, available_tools, tool_selection)#

Bases: BaseModel

available_tools#
input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tool_selection#
class ToolResponseHandlingEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

Determines if an AI agent properly handled a tool’s response, including error handling, data extraction, transformation, and safe information disclosure.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether an AI agent correctly processed the tool result to produce an appropriate output.

  • This metric evaluates what happens AFTER the tool returns, NOT whether the right tool was selected (tool_selection) or invoked correctly (tool_invocation).

  • Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm)

# With custom invocation parameters
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm, temperature=0.0)

# Example: Correct extraction from tool result
eval_input = {
    "input": "What's the weather in Seattle?",
    "tool_call": 'get_weather(location="Seattle")',
    "tool_result": '{"temperature": 58, "conditions": "cloudy"}',
    "output": "Seattle is currently 58°F and cloudy."
}
scores = tool_response_eval.evaluate(eval_input)
print(scores)

# Example: Hallucinated data (incorrect)
eval_input_hallucinated = {
    "input": "What restaurants are nearby?",
    "tool_call": 'search_restaurants(location="downtown")',
    "tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}',
    "output": "I found Cafe Luna and Mario's Italian nearby."
}
scores = tool_response_eval.evaluate(eval_input_hallucinated)
print(scores)  # Should be incorrect - Mario's Italian was hallucinated

# Example: Error handling with retry
eval_input_retry = {
    "input": "Find my recent orders",
    "tool_call": "get_orders(user_id='123')",
    "tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "output": "[Retried] Your order (ORD-001) has shipped."
}
scores = tool_response_eval.evaluate(eval_input_retry)
print(scores)
CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
DIRECTION = 'maximize'#
NAME = 'tool_response_handling'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class ToolResponseHandlingInputSchema(*, input, tool_call, tool_result, output)#

Bases: BaseModel

input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#
tool_call#
tool_result#
class ToolSelectionEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for determining if the correct tool was selected for a given context.

Parameters:
  • llm (LLM) – The LLM instance to use for the evaluation.

  • **kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

  • Evaluates whether an AI agent’s tool selection was correct or incorrect based on the conversation context, available tools, and the agent’s tool invocations.

  • The agent’s tool selection can be a single tool or a list of tools.

  • This metric evaluates the correctness of the tool selection, not the correctness of the tool invocations or the tool outputs.

  • Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.

  • Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.tool_selection import ToolSelectionEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_selection_eval = ToolSelectionEvaluator(llm=llm)

# With custom invocation parameters
tool_selection_eval = ToolSelectionEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "User: What is the weather in San Francisco?",
    "available_tools": (
        "WeatherTool: Get the current weather for a location.\n"
        "NewsTool: Stay connected to global events with our up-to-date news.\n"
        "MusicTool: Create playlists, search for music, and check music trends."
    ),
    "tool_selection": "WeatherTool(location='San Francisco')" # input args optional
}
scores = tool_selection_eval.evaluate(eval_input)
print(scores)
CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
DIRECTION = 'maximize'#
NAME = 'tool_selection'#
PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
class ToolSelectionInputSchema(*, input, available_tools, tool_selection)#

Bases: BaseModel

available_tools#
input#
model_config = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tool_selection#

Utilities#

default_tqdm_progress_bar_formatter(title)#

Returns a progress bar formatter for use with tqdm.

Parameters:

title (str) – The title of the progress bar, displayed as a prefix.

Returns:

A formatter to be passed to the bar_format argument of tqdm.

Return type:

str

extract_with_jsonpath(data, path, match_all=False)#

Extract a value from a nested JSON structure using jsonpath-ng.

Parameters:
  • data – The input dictionary to be extracted from.

  • path – The jsonpath to extract from the data.

  • match_all – If True, return a list of all matches. By default, return only the first match.

Returns:

The extracted value (can be None).

Raises:
  • JsonPathParserError – If the path is not parseable (invalid syntax).

  • ValueError – If the path is invalid or not found (missing key, index out of bounds, etc).

remap_eval_input(eval_input, required_fields, input_mapping=None)#

Remap eval_input keys based on required_fields and an optional input_mapping.

Parameters:
  • eval_input – The input dictionary to be remapped.

  • required_fields – The required field names as a set of strings.

  • input_mapping – Optional mapping from evaluator-required field -> eval_input key.

Returns:

A dictionary with keys as required_fields and values from eval_input.

Raises:

ValueError – If a required field is missing in eval_input or has a null/empty value.

to_annotation_dataframe(dataframe, score_names=None)#

Format scores as annotations for logging to Phoenix.

This function takes the output of evaluate_dataframe, and a list of score names, formats it for Phoenix logging. If no score names are provided, the function will extract all scores from the dataframe (_score columns). Score, label, explanation, and metadata are extracted from the score column and exploded into separate columns. Annotation name and kind are also added as columns.

Parameters:
  • dataframe (pd.DataFrame) – DataFrame returned by (async_)evaluate_dataframe

  • score_names (List[str]) – Names of the score columns to log (e.g., [“precision”,

  • None ("hallucination"]). If)

  • used. (all columns ending with _score will be)

Returns:

DataFrame with the score column, annotation name, and annotator kind columns for the specified score names.

Return type:

pd.DataFrame

Examples:

from phoenix.client import Client
from phoenix.evals import evaluate_dataframe
from phoenix.evals.utils import to_annotation_dataframe

client = Client()
results = evaluate_dataframe(df, evaluators)

# Log only hallucination annotations
hallucination_annotations = to_annotation_dataframe(results, ["hallucination"])
client.spans.log_span_annotations_dataframe(dataframe=hallucination_annotations)

# Log all scores as annotations
all_annotations = to_annotation_dataframe(results)
client.spans.log_span_annotations_dataframe(dataframe=all_annotations)