Evals#

LLM Interfaces #

LLM #

class LLM(*, provider=None, model=None, client=None, initial_per_second_request_rate=None, sync_client_kwargs=None, async_client_kwargs=None, **kwargs)#

Bases: object

An LLM wrapper that simplifies the API for generating text and objects.

This wrapper delegates API access to SDK/client libraries that are installed in the active Python environment. To show supported providers, use show_provider_availability().

The LLM class provides both synchronous and asynchronous methods for all operations.

Examples:

from phoenix.evals.llm import LLM, show_provider_availability
show_provider_availability()
llm = LLM(provider="openai", model="gpt-4o")
llm.generate_text(prompt="Hello, world!")
"Hello, world!"
llm.generate_object(
    prompt="Hello, world!",
    schema={
        "type": "object",
        "properties": {
            "text": {"type": "string"}
        },
        "required": ["text"]
    })
{"text": "Hello, world!"}

async async_generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#

Asynchronously generate a classification given a prompt and a set of labels.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.
labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions
include_explanation (bool) – Whether to prompt the LLM for an explanation.
description (Optional[str]) – A description of the classification task.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated classification.

Return type:

Dict[str, Any]

async async_generate_object(prompt, schema, tracer=None, **kwargs)#

Asynchronously generate an object given a prompt and a schema.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.
schema (Dict[str, Any]) – A JSON schema that describes the generated object.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated object.

Return type:

Dict[str, Any]

async async_generate_text(prompt, tracer=None, **kwargs)#

Asynchronously generate text given a prompt.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.
tracer (Optional[Tracer]) – The tracer to use for tracing.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated text.

Return type:

str

generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#

Generate a classification given a prompt and a set of labels.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.
labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions
include_explanation (bool) – Whether to prompt the LLM for an explanation.
description (Optional[str]) – A description of the classification task.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated classification.

Return type:

Dict[str, Any]

Examples:

from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o", client="openai")
llm.generate_classification(
    prompt="Hello, world!",
    labels=["yes", "no"],
)
{"label": "yes", "explanation": "The answer is yes."}
llm.generate_classification(
    prompt="Hello, world!",
    labels={"yes": "Positive response", "no": "Negative response"},
    include_explanation=False,
)
{"label": "yes"}

generate_object(prompt, schema, tracer=None, **kwargs)#

Generate an object given a prompt and a schema.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.
schema (Dict[str, Any]) – A JSON schema that describes the generated object.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated object.

Return type:

Dict[str, Any]

generate_text(prompt, tracer=None, **kwargs)#

Generate text given a prompt.

Parameters:

prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.
**kwargs – Additional keyword arguments to pass to the LLM SDK.

Returns:

The generated text.

Return type:

str

Prompt Template #

class Template(*, template, template_format=None)#

Bases: object

Template for rendering prompts with mustache ({{variable}}) or f-string ({variable}) formats.

Supports auto-detection of template format and handles JSON content correctly.

Deprecated since version Template: is deprecated. Use PromptTemplate instead, which supports both string templates and message lists (OpenAI-style format).

render(variables, tracer=None)#

Render the template with the given variables.

Parameters:

variables (Dict[str, Any]) – The variables to substitute into the template.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.

Returns:

The rendered template.

Return type:

str

Raises:

TypeError – If variables is not a dictionary.

property variables#

Get the list of variables used in the template.

Returns:: A list of variable names found in the template.
Return type:: List[str]

Evaluator Abstractions #

Evaluator Base #

class Evaluator(*, name, kind, direction='maximize', input_schema=None)#

Bases: ABC

Core abstraction for evaluators.

Supports single-record synchronous (evaluate) and asynchronous (async_evaluate) modes with optional per-call field_mapping.

Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.

Parameters:

name – The name of this evaluator, used for identification and Score naming.
kind – The kind of this evaluator (human, llm, or code).
input_schema – Optional Pydantic BaseModel for input typing and validation. If None, subclasses infer fields from prompts or function signatures and may construct a model dynamically.
direction – The direction for score optimization (“maximize” or “minimize”). Defaults to “maximize”.

async async_evaluate(eval_input, input_mapping=None)#

Async variant of evaluate. Validates and remaps input as described in evaluate.

Returns:: A list of Score objects.

bind(input_mapping)#: Binds an evaluator with a fixed input mapping.

describe()#: Return a JSON-serializable description of the evaluator, including its name, kind, direction, and input fields derived from the Pydantic input schema when available.

property direction#: The direction for score optimization.

evaluate(eval_input, input_mapping=None)#

Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.

Returns:: A list of Score objects.

property input_schema#: Read-only Pydantic input schema for this evaluator, if set.

property kind#: The kind of this evaluator.

property name#: The name of this evaluator.

property source#: The source of this evaluator (deprecated).

unbind()#: Unbinds an evaluator from an input mapping.

LLMEvaluator #

class LLMEvaluator(*, name, llm, prompt_template, schema=None, input_schema=None, direction='maximize', **kwargs)#

Bases: Evaluator

Base LLM evaluator that infers required input fields from its prompt template and constructs a default Pydantic input schema when none is supplied.

Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.

Parameters:

name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation.
prompt_template – The prompt template with placeholders for required fields; used to infer required variables. Can be either a string template or a list of message dictionaries (for chat-based models).
schema – Optional tool/JSON schema for structured output when supported by the LLM.
input_schema – Optional Pydantic model describing/validating inputs. If not provided, a model is dynamically created from the prompt variables (all str, required).
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
**kwargs – Invocation parameters forwarded to the LLM client

async async_evaluate(eval_input, input_mapping=None)#

Async variant of evaluate. Validates and remaps input as described in evaluate.

Returns:: A list of Score objects.

evaluate(eval_input, input_mapping=None)#

Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.

Returns:: A list of Score objects.

property prompt_template#: Get the prompt template.

ClassificationEvaluator #

class ClassificationEvaluator(*, name, llm, prompt_template, choices, include_explanation=True, input_schema=None, direction='maximize', **kwargs)#

Bases: LLMEvaluator

LLM-based evaluator for classification-style judgements.

Supports label-only or label+score mappings, and returns explanations by default. Note: Requires the LLM to have tool calling or structured output capabilities.

Parameters:

name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation. Must support tool calling or structured output for reliable classification.
prompt_template – The prompt template with placeholders for required input fields. Can be either a string template or a list of message dictionaries (for chat-based models). Template variables are inferred automatically.
choices –
Classification choices in one of three formats: a. List[str]: Simple list of label names (e.g., [“positive”, “negative”]).

Scores will be None.
1. Dict[str, Union[float, int]]: Labels mapped to numeric scores
  (e.g., {“positive”: 1.0, “negative”: 0.0}).
2. Dict[str, Tuple[Union[float, int], str]]: Labels mapped to tuples of
  (score, description) (e.g., {“positive”: (1.0, “Positive sentiment”), “negative”: (0.0, “Negative sentiment”)}). Not recommended as LLMs do not reliably follow this schema.
include_explanation – Whether to request explanations for classification decisions. Defaults to True in accordance with best practices.
input_schema – Optional Pydantic model for input validation. If not provided, a model is automatically created from prompt template variables.
direction – Score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
**kwargs – Invocation parameters forwarded to the LLM client

Returns:

A list containing a single Score object with the classification: result, including label, optional score, and optional explanation.

Return type:

List[Score]

Examples

Classification with labels only:

from phoenix.evals import ClassificationEvaluator, LLM

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=LLM(provider="openai", model="gpt-4"),
    prompt_template="Classify the sentiment of this text: {text}",
    choices=["positive", "negative", "neutral"]
)

result = evaluator.evaluate({"text": "I love this product!"})
print(result[0].label)  # "positive"
print(result[0].explanation)  # LLM's reasoning
print(result[0].score)  # None

Classification with scores:

# Map labels to numeric scores
evaluator = ClassificationEvaluator(
    name="quality",
    llm=llm,
    prompt_template="Rate the quality of this response: {response}",
    choices={
        "excellent": 5,
        "good": 4,
        "fair": 3,
        "poor": 2,
        "terrible": 1
    }
)

result = evaluator.evaluate({"response": "Great explanation with examples"})
print(result[0].label)  # "excellent"
print(result[0].score)  # 5

Classification with scores and descriptions (use with caution):

# Map labels to (score, description) tuples
evaluator = ClassificationEvaluator(
    name="relevance",
    llm=llm,
    prompt_template="How relevant is this answer to the question?\n"
                   "Question: {question}\nAnswer: {answer}",
    choices={
        "highly_relevant": (1.0, "Answer directly addresses the question"),
        "somewhat_relevant": (0.5, "Answer partially addresses the question"),
        "not_relevant": (0.0, "Answer does not address the question")
    }
)

result = evaluator.evaluate({
    "question": "What is the capital of France?",
    "answer": "Paris is the capital city of France."
})
print(result[0].label)  # "highly_relevant"
print(result[0].score)  # 1.0

Core Functions #

create_evaluator #

create_evaluator(name, source=None, direction='maximize', kind=None)#

Decorator that turns a simple function into an Evaluator instance.

The decorated function should accept keyword args matching its required fields and return a value that can be converted to a Score. The returned object is an Evaluator with full support for evaluate/async_evaluate and maintains direct callability.

Parameters:

name – Identifier for the evaluator and the name used in produced Scores.
kind – The kind of this evaluator (“human”, “llm”, or “code”). Defaults to “code”.
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

Examples

Basic usage with numeric return:

from phoenix.evals import create_evaluator

@create_evaluator(name="precision")
def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:
    # Calculate precision for information retrieval
    relevant_set = set(relevant_documents)
    hits = sum(1 for doc in retrieved_documents if doc in relevant_set)
    return hits / len(retrieved_documents) if retrieved_documents else 0.0

# Use the evaluator
result = precision.evaluate({
    "retrieved_documents": [1, 2, 3, 4],
    "relevant_documents": [2, 4, 6]
})
print(result[0].score)  # 0.5

# Direct callability maintained:
result = precision(retrieved_documents=[1, 2, 3, 4], relevant_documents=[2, 4, 6])
print(result)  # 0.5

Different return types:

# Boolean return (converted to score and label)
@create_evaluator(name="is_valid")
def is_valid(text: str) -> bool:
    return len(text.strip()) > 0

# Dictionary return with multiple fields
@create_evaluator(name="positive_sentiment")
def positive_sentiment(text: str) -> dict:
    # Simplified sentiment analysis
    positive_words = ["good", "great", "excellent"]
    score = sum(1 for word in positive_words if word in text.lower())
    return {
        "score": score / len(positive_words),
        "label": "positive" if score > 0 else "neutral",
        "explanation": f"Found {score} positive indicators"
    }

# Tuple return (score, label, explanation)
@create_evaluator(name="length_check")
def length_check(text: str) -> tuple:
    length = len(text)
    is_good = 10 <= length <= 100
    return (float(is_good), "good" if is_good else "bad", f"Length: {length}")

Using with dataframes:

import pandas as pd
from phoenix.evals import evaluate_dataframe

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

df = pd.DataFrame({
    "text": ["Hello world", "This is a longer sentence", "Short"]
})

results_df = evaluate_dataframe(dataframe=df, evaluators=[word_count])
print(results_df["word_count_score"])  # JSON scores for each row

Notes

The decorated function can return:

A Score object (no conversion needed)
A number (converted to Score.score)
A boolean (converted to integer Score.score and string Score.label)
A short string (≤3 words, converted to Score.label)
A long string (≥4 words, converted to Score.explanation)
A dictionary with keys “score”, “label”, or “explanation”
A tuple of values (only bool, number, str types allowed)

An input_schema is automatically created from the function signature, capturing the required input fields, their types, and any defaults. For best results, do not use *args or **kwargs.

The decorator automatically handles conversion to a valid Score object.

create_classifier #

create_classifier(name, prompt_template, llm, choices, direction='maximize')#

Factory to create a ClassificationEvaluator.

Note: The evaluator requires the LLM to have tool calling or structured output capabilities.

Parameters:

name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation.
prompt_template – Prompt template string with placeholders for inputs.
choices – One of List[str], Dict[str, number], or Dict[str, Tuple[number, str]] describing classification labels (and optional scores/descriptions).
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.

Returns:

A ClassificationEvaluator instance.

Examples

Creating a simple sentiment classifier:

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4")

sentiment_evaluator = create_classifier(
    name="sentiment",
    prompt_template="Analyze the sentiment: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"]
)

result = sentiment_evaluator.evaluate({"text": "Great product!"})
print(result[0].label)  # "positive"
print(result[0].score)  # None

Creating a classifier with numeric scores:

quality_evaluator = create_classifier(
    name="response_quality",
    prompt_template="Rate this response quality: {response}",
    llm=llm,
    choices={
        "excellent": 5,
        "good": 4,
        "average": 3,
        "poor": 2,
        "terrible": 1
    }
)

result = quality_evaluator.evaluate({"response": "Detailed and helpful answer"})
print(f"Quality: {result[0].label} (Score: {result[0].score})")

Creating a classifier with scores and descriptions:

accuracy_evaluator = create_classifier(
    name="factual_accuracy",
    prompt_template="Check factual accuracy: {claim}",
    llm=llm,
    choices={
        "accurate": (1.0, "Factually correct information"),
        "partially_accurate": (0.5, "Some correct, some incorrect information"),
        "inaccurate": (0.0, "Factually incorrect information")
    }
)

result = accuracy_evaluator.evaluate({"claim": "Paris is the capital of France"})
print(f"Accuracy: {result[0].label} (Score: {result[0].score})")

bind_evaluator #

bind_evaluator(evaluator, input_mapping)#

Helper to bind an evaluator with a fixed input mapping.

This function allows you to create a version of an evaluator that automatically maps input data fields to the evaluator’s expected field names. This is useful when your data schema doesn’t match the evaluator’s expected inputs.

Parameters:

evaluator – The evaluator instance to bind.
input_mapping – A dictionary mapping evaluator field names to either: - String keys for direct field mapping - Callable functions for computed field mapping

Returns:

The same evaluator instance with the input mapping bound.

Examples

Basic field mapping:

from phoenix.evals import create_evaluator, bind_evaluator

@create_evaluator(name="text_length")
def text_length(content: str) -> int:
    return len(content)

# Map 'message' field to 'content' parameter
mapping = {"content": "message"}
bound_evaluator = bind_evaluator(evaluator=text_length, input_mapping=mapping)

# Now we can use 'message' instead of 'content'
result = bound_evaluator.evaluate({"message": "Hello world"})
print(result[0].score)  # 11

Using lambda functions for computed mappings:

@create_evaluator(name="precision")
def precision(retrieved_docs: list, relevant_docs: list) -> float:
    relevant_set = set(relevant_docs)
    hits = sum(1 for doc in retrieved_docs if doc in relevant_set)
    return hits / len(retrieved_docs) if retrieved_docs else 0.0

# Convert single document to list format
mapping = {
    "retrieved_docs": "retrieved_documents",
    "relevant_docs": lambda x: [x["expected_document"]]
}
bound_evaluator = bind_evaluator(evaluator=precision, input_mapping=mapping)

data = {
    "retrieved_documents": [1, 2, 3],
    "expected_document": 2
}
result = bound_evaluator.evaluate(data)

Complex data transformation:

@create_evaluator(name="response_quality")
def response_quality(question: str, answer: str, context: str) -> dict:
    # Simplified quality check
    has_context = context.lower() in answer.lower()
    return {
        "score": 1.0 if has_context else 0.0,
        "label": "good" if has_context else "poor",
        "explanation": ("Answer uses context" if has_context
                       else "Answer ignores context")
    }

# Map nested data structure
mapping = {
    "question": "query",
    "answer": "response.text",
    "context": lambda x: " ".join(x["documents"])
}
bound_evaluator = bind_evaluator(evaluator=response_quality, input_mapping=mapping)

data = {
    "query": "What is the capital?",
    "response": {"text": "Paris is the capital of France"},
    "documents": ["France info", "Paris is the capital"]
}
result = bound_evaluator.evaluate(data)

evaluate_dataframe #

evaluate_dataframe(dataframe, evaluators, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#

Evaluate a dataframe with a list of evaluators and return an augmented dataframe.

This function uses a synchronous executor; for async evaluation, use async_evaluate_dataframe.

Parameters:

dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.
evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.
tqdm_bar_format – Optional format string for the progress bar. If None and hide_tqdm_bar is False, the default progress bar formatter is used.
hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.
exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses SyncExecutor’s default (True).
max_retries – Optional number of times to retry on exceptions. If None, uses SyncExecutor’s default (10).

Returns:

A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution

time, and status.

”{score.name}_score”: JSON-serialized Score objects for each score returned

Examples

Basic dataframe evaluation:

import pandas as pd
from phoenix.evals import create_evaluator, evaluate_dataframe

@create_evaluator(name="word_count")
def word_count(text: str) -> int:
    return len(text.split())

@create_evaluator(name="has_question")
def has_question(text: str) -> bool:
    return "?" in text

df = pd.DataFrame({
    "text": [
        "Hello world",
        "How are you today?",
        "This is a longer sentence with multiple words"
    ]
})

evaluators = [word_count, has_question]
results_df = evaluate_dataframe(dataframe=df, evaluators=evaluators, hide_tqdm_bar=True)

# Results include original columns plus score columns
print(results_df.columns)
# ['text', 'word_count_execution_details', 'has_question_execution_details',
#  'word_count_score', 'has_question_score']

Using with input mapping:

from phoenix.evals import bind_evaluator

@create_evaluator(name="response_length")
def response_length(response: str) -> int:
    return len(response)

# Data has 'answer' column but evaluator expects 'response'
mapping = {"response": "answer"}
bound_evaluator = bind_evaluator(evaluator=response_length, input_mapping=mapping)

df = pd.DataFrame({
    "question": ["What is AI?", "How does ML work?"],
    "answer": ["AI is artificial intelligence",
              "ML uses algorithms to learn patterns"]
})

results_df = evaluate_dataframe(dataframe=df, evaluators=[bound_evaluator])

With progress bar and error handling:

results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=evaluators,
    tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]",
    exit_on_error=False,  # Continue on errors
    max_retries=3
)

# Check for evaluation errors
import json
for idx, row in results_df.iterrows():
    details = json.loads(row['word_count_execution_details'])
    if details['status'] != 'success':
        print(f"Row {idx} failed: {details['exceptions']}")

Notes

Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.
Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.
Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

async_evaluate_dataframe #

async_evaluate_dataframe(dataframe, evaluators, concurrency=None, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#

Evaluate a dataframe with a list of evaluators and return an augmented dataframe.

This function uses an asynchronous executor; for sync evaluation, use evaluate_dataframe.

Parameters:

dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.
evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.
concurrency – Optional number of concurrent consumers. If None, uses AsyncExecutor’s default (3).
tqdm_bar_format – Optional format string for the progress bar. If None, use the default formatter.
hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.
exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses AsyncExecutor’s default (True).
max_retries – Optional number of times to retry on exceptions or timeouts. A task that times out (default 60s per task) counts against this budget and is marked failed once it is exhausted. If None, uses AsyncExecutor’s default (10).

Returns:

A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution

time, and status.

”{score.name}_score”: JSON-serialized Score objects for each score returned

Examples

Basic async evaluation:

import asyncio
import pandas as pd
from phoenix.evals import create_evaluator, async_evaluate_dataframe

@create_evaluator(name="text_analysis")
def text_analysis(text: str) -> dict:
    return {
        "score": len(text.split()),
        "label": "long" if len(text) > 50 else "short"
    }

df = pd.DataFrame({
    "text": [
        "Short text",
        "This is a much longer text that contains many more words and characters",
        "Medium length text here"
    ]
})

async def main():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=[text_analysis],
        concurrency=5  # Process up to 5 rows concurrently
        hide_tqdm_bar=True,
    )
    return results_df

results_df = asyncio.run(main())
print(results_df.columns)

With LLM evaluators:

from phoenix.evals import create_classifier, LLM

llm = LLM(provider="openai", model="gpt-4")

sentiment_evaluator = create_classifier(
    name="sentiment",
    prompt_template="Classify sentiment: {text}",
    llm=llm,
    choices=["positive", "negative", "neutral"]
)

df = pd.DataFrame({
    "text": [
        "I love this product!",
        "This is terrible quality",
        "It's okay, nothing special"
    ]
})

async def evaluate_sentiment():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=[sentiment_evaluator],
        concurrency=2,  # Limit concurrent LLM calls
        tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}"
    )
    return results_df

results_df = asyncio.run(evaluate_sentiment())

Error handling and retries:

async def robust_evaluation():
    results_df = await async_evaluate_dataframe(
        dataframe=df,
        evaluators=evaluators,
        concurrency=3,
        exit_on_error=False,  # Continue despite errors
        max_retries=5,        # Retry failed evaluations
        tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]"
    )

    # Check for failures
    import json
    failed_rows = []
    for idx, row in results_df.iterrows():
        details = json.loads(row['sentiment_execution_details'])
        if details['status'] != 'success':
            failed_rows.append(idx)

    print(f"Failed evaluations: {len(failed_rows)} out of {len(results_df)}")
    return results_df

results_df = asyncio.run(robust_evaluation())

Notes

Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.
Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.
Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.

Score #

class Score(*, name=None, score=None, label=None, explanation=None, metadata=None, direction='maximize', kind=None)#

Bases: object

Represents the result of an evaluation.

A Score contains the evaluation result along with metadata about the evaluation. It can represent numeric scores, categorical labels, explanations, or combinations thereof.

Examples

Creating different types of scores:

from phoenix.evals.evaluators import Score

# Numeric score only
numeric_score = Score(
    name="accuracy",
    score=0.85,
    kind="llm",
    direction="maximize"
)

# Label only (categorical)
label_score = Score(
    name="sentiment",
    label="positive",
    kind="llm",
    direction="maximize"
)

# Score with explanation
detailed_score = Score(
    name="relevance",
    score=0.9,
    label="highly_relevant",
    explanation="The answer directly addresses all aspects of the question",
    metadata={"model": "gpt-4", "confidence": 0.95},
    kind="llm",
    direction="maximize"
)

# Boolean evaluation
boolean_score = Score(
    name="has_citation",
    score=1.0,
    label="true",
    explanation="Found 3 citations in the text",
    kind="code",
    direction="maximize"
)

pretty_print(indent=2)#

Pretty print the Score as formatted JSON.

Parameters:: indent – Number of spaces for indentation. Defaults to 2.

property source#: The source of this score (deprecated).

to_dict()#

Convert the Score to a dictionary, excluding None values.

Returns:: A dictionary representation of the Score with None values excluded.

Built-in Metrics #

class ConcisenessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for assessing whether model outputs are concise and free of unnecessary content.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether the output to an input is concise or verbose.
Returns one Score with label (concise or verbose), score (1.0 if concise, 0.0 if verbose), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.conciseness import ConcisenessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
conciseness_eval = ConcisenessEvaluator(llm=llm)

# With custom invocation parameters
conciseness_eval = ConcisenessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris.",
    }
scores = conciseness_eval.evaluate(eval_input)
print(scores)
[Score(name='conciseness', score=1.0, label='concise',
    explanation='The response directly answers the question with no extra words.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]

CHOICES = {'concise': 1.0, 'verbose': 0.0}#

class ConcisenessInputSchema(*, input, output)#

Bases: BaseModel

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

DIRECTION = 'maximize'#

NAME = 'conciseness'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class CorrectnessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for assessing factual accuracy and completeness of model outputs.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether the output to an input is correct or incorrect.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.correctness import CorrectnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
correctness_eval = CorrectnessEvaluator(llm=llm)

# With custom invocation parameters
correctness_eval = CorrectnessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    }
scores = correctness_eval.evaluate(eval_input)
print(scores)
[Score(name='correctness', score=1.0, label='correct',
    explanation='The response accurately states that Paris is the capital of France.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]

CHOICES = {'correct': 1.0, 'incorrect': 0.0}#

class CorrectnessInputSchema(*, input, output)#

Bases: BaseModel

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

DIRECTION = 'maximize'#

NAME = 'correctness'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class DocumentRelevanceEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for determining document relevance to a given question.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether a document contains information relevant to answering a specific question.
Returns one Score with label (relevant or unrelated), score (1.0 if relevant, 0.0 if unrelated), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.document_relevance import DocumentRelevanceEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
relevance_eval = DocumentRelevanceEvaluator(llm=llm)

# With custom invocation parameters
relevance_eval = DocumentRelevanceEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "document_text": "Paris is the capital and largest city of France"
    }
scores = relevance_eval.evaluate(eval_input)
print(scores)

CHOICES = {'relevant': 1.0, 'unrelated': 0.0}#

DIRECTION = 'maximize'#

class DocumentRelevanceInputSchema(*, input, document_text)#

Bases: BaseModel

document_text#

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

NAME = 'document_relevance'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class FaithfulnessEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for detecting faithfulness in grounded LLM responses.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether the output to an input is faithful or unfaithful based on the context.
Returns one Score with label (faithful or unfaithful), score (1.0 if faithful, 0.0 if unfaithful), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.faithfulness import FaithfulnessEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
faithfulness_eval = FaithfulnessEvaluator(llm=llm)

# With custom invocation parameters
faithfulness_eval = FaithfulnessEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
    }
scores = faithfulness_eval.evaluate(eval_input)
print(scores)
[Score(name='faithfulness', score=1.0, label='faithful',
    explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="maximize")]

CHOICES = {'faithful': 1.0, 'unfaithful': 0.0}#

DIRECTION = 'maximize'#

class FaithfulnessInputSchema(*, input, output, context)#

Bases: BaseModel

context#

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

NAME = 'faithfulness'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class HallucinationEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for detecting hallucinations in grounded LLM responses.

Deprecated since version HallucinationEvaluator: is deprecated. Please use FaithfulnessEvaluator instead. The new evaluator uses ‘faithful’/’unfaithful’ labels and maximizes score (1.0=faithful).

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether the output to an input is factual or hallucinated based on the context.
Returns one Score with label (factual or hallucinated), score (1.0 if hallucinated, 0.0 if factual), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.hallucination import HallucinationEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
hallucination_eval = HallucinationEvaluator(llm=llm)

# With custom invocation parameters
hallucination_eval = HallucinationEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "Paris is the capital of France.",
    "context": "Paris is the capital and largest city of France."
    }
scores = hallucination_eval.evaluate(eval_input)
print(scores)
[Score(name='hallucination', score=0.0, label='factual',
    explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="minimize")]

CHOICES = {'factual': 0.0, 'hallucinated': 1.0}#

DIRECTION = 'minimize'#

class HallucinationInputSchema(*, input, output, context)#

Bases: BaseModel

context#

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

NAME = 'hallucination'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class MatchesRegex(pattern, name=None, include_explanation=True)#

Bases: Evaluator

Evaluates whether text output matches a specified regular expression pattern.

This code evaluator checks if the output contains one or more substrings that match a given regex pattern. It returns a binary score (1.0 for match, 0.0 for no match) along with an explanation of which substrings matched or that no match was found.

Parameters:

pattern – The regular expression pattern to match against. Can be provided as a string or a compiled Pattern object.
name – Optional custom name for the evaluator. If not provided, defaults to “matches_regex”.
include_explanation – Whether to include an explanation in the Score object. Defaults to True.

Examples

Basic usage with URL detection:

import re
from phoenix.evals.metrics.matches_regex import MatchesRegex

# Compiled regex pattern
pattern = re.compile(r"https?://[^\s]+")
contains_link = MatchesRegex(pattern=pattern)

eval_input = {"output": "Check out https://github.com/Arize-ai/phoenix!"}

scores = contains_link.evaluate(eval_input)
print(scores)
# [Score(name='matches_regex',
#        score=1.0,
#        label=None,
#        explanation='There are 1 matches for the regex: https?://[^\s]+',
#        metadata={},
#        kind='code',
#        direction='maximize')]

class InputSchema(*, output)#

Bases: BaseModel

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

class PrecisionRecallFScore(*, beta=1.0, average='macro', zero_division=0.0, positive_label=None)#

Bases: Evaluator

Code evaluator that computes precision, recall, and F-beta score given lists of expected and output labels.

Parameters:

beta (-) – Weight of recall relative to precision. Must be > 0. Defaults to 1.0 (F1).
average (-) – Aggregation strategy across classes. One of {‘macro’,’micro’,’weighted’}. Defaults to ‘macro’. Suffixes are only appended to metric names when a non-default average is used.
positive_label (-) – When set, compute binary precision/recall/F exclusively for this label (one-vs-rest). If None, average is at its default ‘macro’, and labels are numeric with unique set {0,1}, the positive label defaults to 1. Otherwise, multi-class averaging is used.
zero_division (-) – Value to use when a metric is undefined (e.g., 0/0). Defaults to 0.0.
eval_input (Mapping[str, Any]) – Two lists of hashable labels: - expected (Sequence[Hashable]): Expected/true sequence of labels - output (Sequence[Hashable]): Output/predicted sequence of labels

Returns:

A list of Score objects.

Return type:

List[Score]

Raises:

ValueError – If input validation fails.

Notes

Supports labels as strings or integers (must be hashable)
Supports both binary and multi-class classification via averaging strategies
Score Naming:
- Defaults (beta=1.0, average=”macro”): names are precision, recall, and f1.
- Non-default average: e.g., precision_micro, recall_weighted, f0_5_micro.

Examples

Multi-class (macro):

evaluator = PrecisionRecallFScore(beta=1.0, average="macro")
eval_input = {"expected": ["cat", "dog", "cat", "bird"],
              "output": ["cat", "cat", "cat", "bird"]}
scores = evaluator(eval_input)
[s.name for s in scores]
['precision', 'recall', 'f1']

Binary with explicit positive label:

evaluator = PrecisionRecallFScore(beta=0.5, positive_label="spam")
eval_input = {"expected": ["spam", "ham", "spam"],
              "output": ["spam", "spam", "ham"]}
scores = evaluator(eval_input)
[s.name for s in scores]
['precision', 'recall', 'f0_5']

class InputSchema(*, expected, output)#

Bases: BaseModel

expected#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

class RefusalEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for detecting when an LLM refuses or declines to answer a query.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Detects refusals, deflections, scope disclaimers, and non-answers.
Returns one Score with label (refused or answered), score (1.0 if refused, 0.0 if answered), and an explanation from the LLM judge.
This metric is use-case agnostic: it only detects whether a refusal occurred, not whether the refusal was appropriate.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.refusal import RefusalEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
refusal_eval = RefusalEvaluator(llm=llm)

# With custom invocation parameters
refusal_eval = RefusalEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "What is the capital of France?",
    "output": "I'm sorry, I can only help with technical questions.",
    }
scores = refusal_eval.evaluate(eval_input)
print(scores)
[Score(name='refusal', score=1.0, label='refused',
    explanation='The response refuses to answer by claiming scope limitations.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="neutral")]

CHOICES = {'answered': 0.0, 'refused': 1.0}#

DIRECTION = 'neutral'#

NAME = 'refusal'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class RefusalInputSchema(*, input, output)#

Bases: BaseModel

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

class ToolInvocationEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

Determines if a tool was invoked correctly with proper arguments, formatting, and safe content.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether an AI agent’s tool invocation was correct or incorrect based on the conversation context, available tool schemas, and the agent’s tool invocation(s).
This metric evaluates the correctness of the tool invocation (arguments, formatting, safety), not the correctness of the tool selection itself.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Criteria for Correct Invocation:

JSON is properly structured (if applicable).
All required fields/parameters are present.
No hallucinated or nonexistent fields (all fields exist in the tool schema).
Argument values match the user query and schema expectations.
No unsafe content (e.g., PII) in arguments.

Criteria for Incorrect Invocation:

Hallucinated or nonexistent fields not in the schema.
Missing required fields/parameters.
Improperly formatted or malformed JSON.
Incorrect, hallucinated, or mismatched argument values.
Unsafe content (e.g., PII, sensitive data) in arguments.

Examples:

from phoenix.evals.metrics.tool_invocation import ToolInvocationEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_invocation_eval = ToolInvocationEvaluator(llm=llm)

# With custom invocation parameters
tool_invocation_eval = ToolInvocationEvaluator(llm=llm, temperature=0.0)

# Example with JSON schema format for available tools
eval_input = {
    "input": "User: Book a flight from NYC to LA for tomorrow",
    "available_tools": '''
    {
        "name": "book_flight",
        "description": "Book a flight between two cities",
        "parameters": {
            "type": "object",
            "properties": {
                "origin": {"type": "string", "description": "Departure city code"},
                "destination": {"type": "string", "description": "Arrival city code"},
                "date": {"type": "string", "description": "Flight date in YYYY-MM-DD"}
            },
            "required": ["origin", "destination", "date"]
        }
    }
    ''',
    "tool_selection": '''
    book_flight(origin="NYC", destination="LA", date="2024-01-15")
    '''
}
scores = tool_invocation_eval.evaluate(eval_input)
print(scores)

# Example with human-readable format for available tools
eval_input_readable = {
    "input": "User: What's the weather in San Francisco?",
    "available_tools": '''
    WeatherTool:
      Description: Get the current weather for a location
      Parameters:
        - location (required): The city name or coordinates
        - units (optional): Temperature units (celsius or fahrenheit)
    ''',
    "tool_selection": "WeatherTool(location='San Francisco', units='fahrenheit')"
}
scores = tool_invocation_eval.evaluate(eval_input_readable)
print(scores)

CHOICES = {'correct': 1.0, 'incorrect': 0.0}#

DIRECTION = 'maximize'#

NAME = 'tool_invocation'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class ToolInvocationInputSchema(*, input, available_tools, tool_selection)#

Bases: BaseModel

available_tools#

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tool_selection#

class ToolResponseHandlingEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

Determines if an AI agent properly handled a tool’s response, including error handling, data extraction, transformation, and safe information disclosure.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether an AI agent correctly processed the tool result to produce an appropriate output.
This metric evaluates what happens AFTER the tool returns, NOT whether the right tool was selected (tool_selection) or invoked correctly (tool_invocation).
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator
from phoenix.evals import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm)

# With custom invocation parameters
tool_response_eval = ToolResponseHandlingEvaluator(llm=llm, temperature=0.0)

# Example: Correct extraction from tool result
eval_input = {
    "input": "What's the weather in Seattle?",
    "tool_call": 'get_weather(location="Seattle")',
    "tool_result": '{"temperature": 58, "conditions": "cloudy"}',
    "output": "Seattle is currently 58°F and cloudy."
}
scores = tool_response_eval.evaluate(eval_input)
print(scores)

# Example: Hallucinated data (incorrect)
eval_input_hallucinated = {
    "input": "What restaurants are nearby?",
    "tool_call": 'search_restaurants(location="downtown")',
    "tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}',
    "output": "I found Cafe Luna and Mario's Italian nearby."
}
scores = tool_response_eval.evaluate(eval_input_hallucinated)
print(scores)  # Should be incorrect - Mario's Italian was hallucinated

# Example: Error handling with retry
eval_input_retry = {
    "input": "Find my recent orders",
    "tool_call": "get_orders(user_id='123')",
    "tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}',
    "output": "[Retried] Your order (ORD-001) has shipped."
}
scores = tool_response_eval.evaluate(eval_input_retry)
print(scores)

CHOICES = {'correct': 1.0, 'incorrect': 0.0}#

DIRECTION = 'maximize'#

NAME = 'tool_response_handling'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class ToolResponseHandlingInputSchema(*, input, tool_call, tool_result, output)#

Bases: BaseModel

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

output#

tool_call#

tool_result#

class ToolSelectionEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

A specialized evaluator for determining if the correct tool was selected for a given context.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Evaluates whether an AI agent’s tool selection was correct or incorrect based on the conversation context, available tools, and the agent’s tool invocations.
The agent’s tool selection can be a single tool or a list of tools.
This metric evaluates the correctness of the tool selection, not the correctness of the tool invocations or the tool outputs.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.tool_selection import ToolSelectionEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
tool_selection_eval = ToolSelectionEvaluator(llm=llm)

# With custom invocation parameters
tool_selection_eval = ToolSelectionEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "input": "User: What is the weather in San Francisco?",
    "available_tools": (
        "WeatherTool: Get the current weather for a location.\n"
        "NewsTool: Stay connected to global events with our up-to-date news.\n"
        "MusicTool: Create playlists, search for music, and check music trends."
    ),
    "tool_selection": "WeatherTool(location='San Francisco')" # input args optional
}
scores = tool_selection_eval.evaluate(eval_input)
print(scores)

CHOICES = {'correct': 1.0, 'incorrect': 0.0}#

DIRECTION = 'maximize'#

NAME = 'tool_selection'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class ToolSelectionInputSchema(*, input, available_tools, tool_selection)#

Bases: BaseModel

available_tools#

input#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tool_selection#

class ToxicityEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for detecting toxic text — hateful, demeaning, abusive, or threatening.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Classifies a single piece of text (a model output or a user input) as toxic or non-toxic.
Returns one Score with label (toxic or non-toxic), score (1.0 if toxic, 0.0 if non-toxic), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.toxicity import ToxicityEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
toxicity_eval = ToxicityEvaluator(llm=llm)

# With custom invocation parameters
toxicity_eval = ToxicityEvaluator(llm=llm, temperature=0.0)

eval_input = {"text": "You are a worthless idiot and everyone despises you."}
scores = toxicity_eval.evaluate(eval_input)
print(scores)
[Score(name='toxicity', score=1.0, label='toxic',
    explanation='The text directs abusive, demeaning language at a person.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="minimize")]

CHOICES = {'non-toxic': 0.0, 'toxic': 1.0}#

DIRECTION = 'minimize'#

NAME = 'toxicity'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class ToxicityInputSchema(*, text)#

Bases: BaseModel

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

text#

class UserFrictionEvaluator(llm, **kwargs)#

Bases: ClassificationEvaluator

An evaluator for detecting when a user expresses friction with an assistant’s preceding behavior.

Parameters:

llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g., temperature=0.0, max_tokens=256).

Notes

Detects expressed corrections, retries, frustration, and challenges directed at the assistant’s preceding behavior.
The conversation must contain the history before the target user message. The target message is supplied separately as user_message so the judge cannot confuse it with earlier turns.
Returns one Score with label (friction or no_friction), score (1.0 if friction, 0.0 if no_friction), and an explanation from the LLM judge.
no_friction does not prove the user was satisfied; users often abandon conversations without saying why.
Requires an LLM that supports tool calling or structured output.

Examples:

from phoenix.evals.metrics.user_friction import UserFrictionEvaluator
from phoenix.evals import LLM
llm = LLM(provider="openai", model="gpt-4o-mini")

# Default usage
user_friction_eval = UserFrictionEvaluator(llm=llm)

# With custom invocation parameters
user_friction_eval = UserFrictionEvaluator(llm=llm, temperature=0.0)

eval_input = {
    "conversation": (
        "User: Show orders from this week.\n"
        "Assistant: Here are last month's orders."
    ),
    "user_message": "No, I asked for this week.",
    }
scores = user_friction_eval.evaluate(eval_input)
print(scores)
[Score(name='user_friction', score=1.0, label='friction',
    explanation='The user corrects the assistant for answering the wrong time range.',
    metadata={'model': 'gpt-4o-mini'},
    kind="llm", direction="minimize")]

CHOICES = {'friction': 1.0, 'no_friction': 0.0}#

DIRECTION = 'minimize'#

NAME = 'user_friction'#

PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#

class UserFrictionInputSchema(*, conversation, user_message)#

Bases: BaseModel

conversation#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

user_message#

Utilities #

default_tqdm_progress_bar_formatter(title)#

Returns a progress bar formatter for use with tqdm.

Parameters:: title (str) – The title of the progress bar, displayed as a prefix.
Returns:: A formatter to be passed to the bar_format argument of tqdm.
Return type:: str

extract_with_jsonpath(data, path, match_all=False)#

Extract a value from a nested JSON structure using jsonpath-ng.

Parameters:

data – The input dictionary to be extracted from.
path – The jsonpath to extract from the data.
match_all – If True, return a list of all matches. By default, return only the first match.

Returns:

The extracted value (can be None).

Raises:

JsonPathParserError – If the path is not parseable (invalid syntax).
ValueError – If the path is invalid or not found (missing key, index out of bounds, etc).

remap_eval_input(eval_input, required_fields, input_mapping=None)#

Remap eval_input keys based on required_fields and an optional input_mapping.

Parameters:

eval_input – The input dictionary to be remapped.
required_fields – The required field names as a set of strings.
input_mapping – Optional mapping from evaluator-required field -> eval_input key.

Returns:

A dictionary with keys as required_fields and values from eval_input.

Raises:

ValueError – If a required field is missing in eval_input or has a null/empty value.

to_annotation_dataframe(dataframe, score_names=None)#

Format scores as annotations for logging to Phoenix.

This function takes the output of evaluate_dataframe, and a list of score names, formats it for Phoenix logging. If no score names are provided, the function will extract all scores from the dataframe (_score columns). Score, label, explanation, and metadata are extracted from the score column and exploded into separate columns. Annotation name and kind are also added as columns.

Parameters:

dataframe (pd.DataFrame) – DataFrame returned by (async_)evaluate_dataframe
score_names (List[str]) – Names of the score columns to log (e.g., [“precision”,
None ("hallucination"]). If)
used. (all columns ending with _score will be)

Returns:

DataFrame with the score column, annotation name, and annotator kind columns for the specified score names.

Return type:

pd.DataFrame

Examples:

from phoenix.client import Client
from phoenix.evals import evaluate_dataframe
from phoenix.evals.utils import to_annotation_dataframe

client = Client()
results = evaluate_dataframe(df, evaluators)

# Log only hallucination annotations
hallucination_annotations = to_annotation_dataframe(results, ["hallucination"])
client.spans.log_span_annotations_dataframe(dataframe=hallucination_annotations)

# Log all scores as annotations
all_annotations = to_annotation_dataframe(results)
client.spans.log_span_annotations_dataframe(dataframe=all_annotations)