Evals#
LLM Interfaces#
LLM#
- class LLM(*, provider=None, model=None, client=None, initial_per_second_request_rate=None, sync_client_kwargs=None, async_client_kwargs=None, **kwargs)#
Bases:
objectAn LLM wrapper that simplifies the API for generating text and objects.
This wrapper delegates API access to SDK/client libraries that are installed in the active Python environment. To show supported providers, use show_provider_availability().
The LLM class provides both synchronous and asynchronous methods for all operations.
Examples:
from phoenix.evals.llm import LLM, show_provider_availability show_provider_availability() llm = LLM(provider="openai", model="gpt-4o") llm.generate_text(prompt="Hello, world!") "Hello, world!" llm.generate_object( prompt="Hello, world!", schema={ "type": "object", "properties": { "text": {"type": "string"} }, "required": ["text"] }) {"text": "Hello, world!"}
- async async_generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#
Asynchronously generate a classification given a prompt and a set of labels.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.
labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions
include_explanation (bool) – Whether to prompt the LLM for an explanation.
description (Optional[str]) – A description of the classification task.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated classification.
- Return type:
Dict[str, Any]
- async async_generate_object(prompt, schema, tracer=None, **kwargs)#
Asynchronously generate an object given a prompt and a schema.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.
schema (Dict[str, Any]) – A JSON schema that describes the generated object.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated object.
- Return type:
Dict[str, Any]
- async async_generate_text(prompt, tracer=None, **kwargs)#
Asynchronously generate text given a prompt.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.
tracer (Optional[Tracer]) – The tracer to use for tracing.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated text.
- Return type:
str
- generate_classification(prompt, labels, include_explanation=True, description=None, **kwargs)#
Generate a classification given a prompt and a set of labels.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt template to go with the tool call.
labels (Union[List[str], Dict[str, str]]) – Either: - A list of strings, where each string is a label - A dictionary where keys are labels and values are descriptions
include_explanation (bool) – Whether to prompt the LLM for an explanation.
description (Optional[str]) – A description of the classification task.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated classification.
- Return type:
Dict[str, Any]
Examples:
from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o", client="openai") llm.generate_classification( prompt="Hello, world!", labels=["yes", "no"], ) {"label": "yes", "explanation": "The answer is yes."} llm.generate_classification( prompt="Hello, world!", labels={"yes": "Positive response", "no": "Negative response"}, include_explanation=False, ) {"label": "yes"}
- generate_object(prompt, schema, tracer=None, **kwargs)#
Generate an object given a prompt and a schema.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate the object from.
schema (Dict[str, Any]) – A JSON schema that describes the generated object.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated object.
- Return type:
Dict[str, Any]
- generate_text(prompt, tracer=None, **kwargs)#
Generate text given a prompt.
- Parameters:
prompt (Union[str, List[Dict[str, Any]]]) – The prompt to generate text from.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.
**kwargs – Additional keyword arguments to pass to the LLM SDK.
- Returns:
The generated text.
- Return type:
str
Prompt Template#
- class Template(*, template, template_format=None)#
Bases:
objectTemplate for rendering prompts with mustache ({{variable}}) or f-string ({variable}) formats.
Supports auto-detection of template format and handles JSON content correctly.
Deprecated since version Template: is deprecated. Use PromptTemplate instead, which supports both string templates and message lists (OpenAI-style format).
- render(variables, tracer=None)#
Render the template with the given variables.
- Parameters:
variables (Dict[str, Any]) – The variables to substitute into the template.
tracer (Optional[Tracer]) – Optional tracer for tracing operations.
- Returns:
The rendered template.
- Return type:
str
- Raises:
TypeError – If variables is not a dictionary.
- property variables#
Get the list of variables used in the template.
- Returns:
A list of variable names found in the template.
- Return type:
List[str]
Evaluator Abstractions#
Evaluator Base#
- class Evaluator(*, name, kind, direction='maximize', input_schema=None)#
Bases:
ABCCore abstraction for evaluators.
Supports single-record synchronous (evaluate) and asynchronous (async_evaluate) modes with optional per-call field_mapping.
Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.
- Parameters:
name – The name of this evaluator, used for identification and Score naming.
kind – The kind of this evaluator (human, llm, or code).
input_schema – Optional Pydantic BaseModel for input typing and validation. If None, subclasses infer fields from prompts or function signatures and may construct a model dynamically.
direction – The direction for score optimization (“maximize” or “minimize”). Defaults to “maximize”.
- async async_evaluate(eval_input, input_mapping=None)#
Async variant of evaluate. Validates and remaps input as described in evaluate.
- Returns:
A list of Score objects.
- bind(input_mapping)#
Binds an evaluator with a fixed input mapping.
- describe()#
Return a JSON-serializable description of the evaluator, including its name, kind, direction, and input fields derived from the Pydantic input schema when available.
- property direction#
The direction for score optimization.
- evaluate(eval_input, input_mapping=None)#
Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.
- Returns:
A list of Score objects.
- property input_schema#
Read-only Pydantic input schema for this evaluator, if set.
- property kind#
The kind of this evaluator.
- property name#
The name of this evaluator.
- property source#
The source of this evaluator (deprecated).
- unbind()#
Unbinds an evaluator from an input mapping.
LLMEvaluator#
- class LLMEvaluator(*, name, llm, prompt_template, schema=None, input_schema=None, direction='maximize', **kwargs)#
Bases:
EvaluatorBase LLM evaluator that infers required input fields from its prompt template and constructs a default Pydantic input schema when none is supplied.
Note: Subclasses must implement either the _evaluate or _async_evaluate method. Implementing both methods is recommended.
- Parameters:
name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation.
prompt_template – The prompt template with placeholders for required fields; used to infer required variables. Can be either a string template or a list of message dictionaries (for chat-based models).
schema – Optional tool/JSON schema for structured output when supported by the LLM.
input_schema – Optional Pydantic model describing/validating inputs. If not provided, a model is dynamically created from the prompt variables (all str, required).
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
**kwargs – Invocation parameters forwarded to the LLM client
- async async_evaluate(eval_input, input_mapping=None)#
Async variant of evaluate. Validates and remaps input as described in evaluate.
- Returns:
A list of Score objects.
- evaluate(eval_input, input_mapping=None)#
Validate and remap eval_input using the evaluator’s input fields (from input_schema when available, otherwise from the provided input_mapping). An optional per-call input_mapping maps evaluator-required field names to keys/paths in eval_input.
- Returns:
A list of Score objects.
- property prompt_template#
Get the prompt template.
ClassificationEvaluator#
- class ClassificationEvaluator(*, name, llm, prompt_template, choices, include_explanation=True, input_schema=None, direction='maximize', **kwargs)#
Bases:
LLMEvaluatorLLM-based evaluator for classification-style judgements.
Supports label-only or label+score mappings, and returns explanations by default. Note: Requires the LLM to have tool calling or structured output capabilities.
- Parameters:
name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation. Must support tool calling or structured output for reliable classification.
prompt_template – The prompt template with placeholders for required input fields. Can be either a string template or a list of message dictionaries (for chat-based models). Template variables are inferred automatically.
choices –
Classification choices in one of three formats: a. List[str]: Simple list of label names (e.g., [“positive”, “negative”]).
Scores will be None.
- Dict[str, Union[float, int]]: Labels mapped to numeric scores
(e.g., {“positive”: 1.0, “negative”: 0.0}).
- Dict[str, Tuple[Union[float, int], str]]: Labels mapped to tuples of
(score, description) (e.g., {“positive”: (1.0, “Positive sentiment”), “negative”: (0.0, “Negative sentiment”)}). Not recommended as LLMs do not reliably follow this schema.
include_explanation – Whether to request explanations for classification decisions. Defaults to True in accordance with best practices.
input_schema – Optional Pydantic model for input validation. If not provided, a model is automatically created from prompt template variables.
direction – Score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
**kwargs – Invocation parameters forwarded to the LLM client
- Returns:
- A list containing a single Score object with the classification
result, including label, optional score, and optional explanation.
- Return type:
List[Score]
Examples
Classification with labels only:
from phoenix.evals import ClassificationEvaluator, LLM evaluator = ClassificationEvaluator( name="sentiment", llm=LLM(provider="openai", model="gpt-4"), prompt_template="Classify the sentiment of this text: {text}", choices=["positive", "negative", "neutral"] ) result = evaluator.evaluate({"text": "I love this product!"}) print(result[0].label) # "positive" print(result[0].explanation) # LLM's reasoning print(result[0].score) # None
Classification with scores:
# Map labels to numeric scores evaluator = ClassificationEvaluator( name="quality", llm=llm, prompt_template="Rate the quality of this response: {response}", choices={ "excellent": 5, "good": 4, "fair": 3, "poor": 2, "terrible": 1 } ) result = evaluator.evaluate({"response": "Great explanation with examples"}) print(result[0].label) # "excellent" print(result[0].score) # 5
Classification with scores and descriptions (use with caution):
# Map labels to (score, description) tuples evaluator = ClassificationEvaluator( name="relevance", llm=llm, prompt_template="How relevant is this answer to the question?\n" "Question: {question}\nAnswer: {answer}", choices={ "highly_relevant": (1.0, "Answer directly addresses the question"), "somewhat_relevant": (0.5, "Answer partially addresses the question"), "not_relevant": (0.0, "Answer does not address the question") } ) result = evaluator.evaluate({ "question": "What is the capital of France?", "answer": "Paris is the capital city of France." }) print(result[0].label) # "highly_relevant" print(result[0].score) # 1.0
Core Functions#
create_evaluator#
- create_evaluator(name, source=None, direction='maximize', kind=None)#
Decorator that turns a simple function into an Evaluator instance.
The decorated function should accept keyword args matching its required fields and return a value that can be converted to a Score. The returned object is an Evaluator with full support for evaluate/async_evaluate and maintains direct callability.
- Parameters:
name – Identifier for the evaluator and the name used in produced Scores.
kind – The kind of this evaluator (“human”, “llm”, or “code”). Defaults to “code”.
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
Examples
Basic usage with numeric return:
from phoenix.evals import create_evaluator @create_evaluator(name="precision") def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float: # Calculate precision for information retrieval relevant_set = set(relevant_documents) hits = sum(1 for doc in retrieved_documents if doc in relevant_set) return hits / len(retrieved_documents) if retrieved_documents else 0.0 # Use the evaluator result = precision.evaluate({ "retrieved_documents": [1, 2, 3, 4], "relevant_documents": [2, 4, 6] }) print(result[0].score) # 0.5 # Direct callability maintained: result = precision(retrieved_documents=[1, 2, 3, 4], relevant_documents=[2, 4, 6]) print(result) # 0.5
Different return types:
# Boolean return (converted to score and label) @create_evaluator(name="is_valid") def is_valid(text: str) -> bool: return len(text.strip()) > 0 # Dictionary return with multiple fields @create_evaluator(name="positive_sentiment") def positive_sentiment(text: str) -> dict: # Simplified sentiment analysis positive_words = ["good", "great", "excellent"] score = sum(1 for word in positive_words if word in text.lower()) return { "score": score / len(positive_words), "label": "positive" if score > 0 else "neutral", "explanation": f"Found {score} positive indicators" } # Tuple return (score, label, explanation) @create_evaluator(name="length_check") def length_check(text: str) -> tuple: length = len(text) is_good = 10 <= length <= 100 return (float(is_good), "good" if is_good else "bad", f"Length: {length}")
Using with dataframes:
import pandas as pd from phoenix.evals import evaluate_dataframe @create_evaluator(name="word_count") def word_count(text: str) -> int: return len(text.split()) df = pd.DataFrame({ "text": ["Hello world", "This is a longer sentence", "Short"] }) results_df = evaluate_dataframe(dataframe=df, evaluators=[word_count]) print(results_df["word_count_score"]) # JSON scores for each row
Notes
The decorated function can return:
A Score object (no conversion needed)
A number (converted to Score.score)
A boolean (converted to integer Score.score and string Score.label)
A short string (≤3 words, converted to Score.label)
A long string (≥4 words, converted to Score.explanation)
A dictionary with keys “score”, “label”, or “explanation”
A tuple of values (only bool, number, str types allowed)
An input_schema is automatically created from the function signature, capturing the required input fields, their types, and any defaults. For best results, do not use *args or **kwargs.
The decorator automatically handles conversion to a valid Score object.
create_classifier#
- create_classifier(name, prompt_template, llm, choices, direction='maximize')#
Factory to create a ClassificationEvaluator.
Note: The evaluator requires the LLM to have tool calling or structured output capabilities.
- Parameters:
name – Identifier for this evaluator and the name used in produced Scores.
llm – The LLM instance to use for evaluation.
prompt_template – Prompt template string with placeholders for inputs.
choices – One of List[str], Dict[str, number], or Dict[str, Tuple[number, str]] describing classification labels (and optional scores/descriptions).
direction – The score optimization direction (“maximize” or “minimize”). Defaults to “maximize”.
- Returns:
A ClassificationEvaluator instance.
Examples
Creating a simple sentiment classifier:
from phoenix.evals import create_classifier, LLM llm = LLM(provider="openai", model="gpt-4") sentiment_evaluator = create_classifier( name="sentiment", prompt_template="Analyze the sentiment: {text}", llm=llm, choices=["positive", "negative", "neutral"] ) result = sentiment_evaluator.evaluate({"text": "Great product!"}) print(result[0].label) # "positive" print(result[0].score) # None
Creating a classifier with numeric scores:
quality_evaluator = create_classifier( name="response_quality", prompt_template="Rate this response quality: {response}", llm=llm, choices={ "excellent": 5, "good": 4, "average": 3, "poor": 2, "terrible": 1 } ) result = quality_evaluator.evaluate({"response": "Detailed and helpful answer"}) print(f"Quality: {result[0].label} (Score: {result[0].score})")
Creating a classifier with scores and descriptions:
accuracy_evaluator = create_classifier( name="factual_accuracy", prompt_template="Check factual accuracy: {claim}", llm=llm, choices={ "accurate": (1.0, "Factually correct information"), "partially_accurate": (0.5, "Some correct, some incorrect information"), "inaccurate": (0.0, "Factually incorrect information") } ) result = accuracy_evaluator.evaluate({"claim": "Paris is the capital of France"}) print(f"Accuracy: {result[0].label} (Score: {result[0].score})")
bind_evaluator#
- bind_evaluator(evaluator, input_mapping)#
Helper to bind an evaluator with a fixed input mapping.
This function allows you to create a version of an evaluator that automatically maps input data fields to the evaluator’s expected field names. This is useful when your data schema doesn’t match the evaluator’s expected inputs.
- Parameters:
evaluator – The evaluator instance to bind.
input_mapping – A dictionary mapping evaluator field names to either: - String keys for direct field mapping - Callable functions for computed field mapping
- Returns:
The same evaluator instance with the input mapping bound.
Examples
Basic field mapping:
from phoenix.evals import create_evaluator, bind_evaluator @create_evaluator(name="text_length") def text_length(content: str) -> int: return len(content) # Map 'message' field to 'content' parameter mapping = {"content": "message"} bound_evaluator = bind_evaluator(evaluator=text_length, input_mapping=mapping) # Now we can use 'message' instead of 'content' result = bound_evaluator.evaluate({"message": "Hello world"}) print(result[0].score) # 11
Using lambda functions for computed mappings:
@create_evaluator(name="precision") def precision(retrieved_docs: list, relevant_docs: list) -> float: relevant_set = set(relevant_docs) hits = sum(1 for doc in retrieved_docs if doc in relevant_set) return hits / len(retrieved_docs) if retrieved_docs else 0.0 # Convert single document to list format mapping = { "retrieved_docs": "retrieved_documents", "relevant_docs": lambda x: [x["expected_document"]] } bound_evaluator = bind_evaluator(evaluator=precision, input_mapping=mapping) data = { "retrieved_documents": [1, 2, 3], "expected_document": 2 } result = bound_evaluator.evaluate(data)
Complex data transformation:
@create_evaluator(name="response_quality") def response_quality(question: str, answer: str, context: str) -> dict: # Simplified quality check has_context = context.lower() in answer.lower() return { "score": 1.0 if has_context else 0.0, "label": "good" if has_context else "poor", "explanation": ("Answer uses context" if has_context else "Answer ignores context") } # Map nested data structure mapping = { "question": "query", "answer": "response.text", "context": lambda x: " ".join(x["documents"]) } bound_evaluator = bind_evaluator(evaluator=response_quality, input_mapping=mapping) data = { "query": "What is the capital?", "response": {"text": "Paris is the capital of France"}, "documents": ["France info", "Paris is the capital"] } result = bound_evaluator.evaluate(data)
evaluate_dataframe#
- evaluate_dataframe(dataframe, evaluators, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#
Evaluate a dataframe with a list of evaluators and return an augmented dataframe.
This function uses a synchronous executor; for async evaluation, use async_evaluate_dataframe.
- Parameters:
dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.
evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.
tqdm_bar_format – Optional format string for the progress bar. If None and hide_tqdm_bar is False, the default progress bar formatter is used.
hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.
exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses SyncExecutor’s default (True).
max_retries – Optional number of times to retry on exceptions. If None, uses SyncExecutor’s default (10).
- Returns:
A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution
time, and status.
”{score.name}_score”: JSON-serialized Score objects for each score returned
Examples
Basic dataframe evaluation:
import pandas as pd from phoenix.evals import create_evaluator, evaluate_dataframe @create_evaluator(name="word_count") def word_count(text: str) -> int: return len(text.split()) @create_evaluator(name="has_question") def has_question(text: str) -> bool: return "?" in text df = pd.DataFrame({ "text": [ "Hello world", "How are you today?", "This is a longer sentence with multiple words" ] }) evaluators = [word_count, has_question] results_df = evaluate_dataframe(dataframe=df, evaluators=evaluators, hide_tqdm_bar=True) # Results include original columns plus score columns print(results_df.columns) # ['text', 'word_count_execution_details', 'has_question_execution_details', # 'word_count_score', 'has_question_score']
Using with input mapping:
from phoenix.evals import bind_evaluator @create_evaluator(name="response_length") def response_length(response: str) -> int: return len(response) # Data has 'answer' column but evaluator expects 'response' mapping = {"response": "answer"} bound_evaluator = bind_evaluator(evaluator=response_length, input_mapping=mapping) df = pd.DataFrame({ "question": ["What is AI?", "How does ML work?"], "answer": ["AI is artificial intelligence", "ML uses algorithms to learn patterns"] }) results_df = evaluate_dataframe(dataframe=df, evaluators=[bound_evaluator])
With progress bar and error handling:
results_df = evaluate_dataframe( dataframe=df, evaluators=evaluators, tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]", exit_on_error=False, # Continue on errors max_retries=3 ) # Check for evaluation errors import json for idx, row in results_df.iterrows(): details = json.loads(row['word_count_execution_details']) if details['status'] != 'success': print(f"Row {idx} failed: {details['exceptions']}")
Notes
Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.
Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.
Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.
async_evaluate_dataframe#
- async_evaluate_dataframe(dataframe, evaluators, concurrency=None, tqdm_bar_format=None, hide_tqdm_bar=False, exit_on_error=None, max_retries=None)#
Evaluate a dataframe with a list of evaluators and return an augmented dataframe.
This function uses an asynchronous executor; for sync evaluation, use evaluate_dataframe.
- Parameters:
dataframe – The input dataframe to evaluate. Each row will be converted to a dict and passed to each evaluator.
evaluators – List of evaluators to apply to each row. Input mapping should be already bound via bind_evaluator or column names should match evaluator input fields.
concurrency – Optional number of concurrent consumers. If None, uses AsyncExecutor’s default (3).
tqdm_bar_format – Optional format string for the progress bar. If None, use the default formatter.
hide_tqdm_bar – Optional flag to control whether to hide the progress bar. If None, the progress bar is shown. Defaults to False.
exit_on_error – Optional flag to control whether execution should stop on the first error. If None, uses AsyncExecutor’s default (True).
max_retries – Optional number of times to retry on exceptions. If None, uses AsyncExecutor’s default (10).
- Returns:
A copy of the input dataframe with additional columns for scores and exceptions. For each evaluator, columns are added for: - “{evaluator.name}_execution_details”: Details about any exceptions encountered, execution
time, and status.
”{score.name}_score”: JSON-serialized Score objects for each score returned
Examples
Basic async evaluation:
import asyncio import pandas as pd from phoenix.evals import create_evaluator, async_evaluate_dataframe @create_evaluator(name="text_analysis") def text_analysis(text: str) -> dict: return { "score": len(text.split()), "label": "long" if len(text) > 50 else "short" } df = pd.DataFrame({ "text": [ "Short text", "This is a much longer text that contains many more words and characters", "Medium length text here" ] }) async def main(): results_df = await async_evaluate_dataframe( dataframe=df, evaluators=[text_analysis], concurrency=5 # Process up to 5 rows concurrently hide_tqdm_bar=True, ) return results_df results_df = asyncio.run(main()) print(results_df.columns)
With LLM evaluators:
from phoenix.evals import create_classifier, LLM llm = LLM(provider="openai", model="gpt-4") sentiment_evaluator = create_classifier( name="sentiment", prompt_template="Classify sentiment: {text}", llm=llm, choices=["positive", "negative", "neutral"] ) df = pd.DataFrame({ "text": [ "I love this product!", "This is terrible quality", "It's okay, nothing special" ] }) async def evaluate_sentiment(): results_df = await async_evaluate_dataframe( dataframe=df, evaluators=[sentiment_evaluator], concurrency=2, # Limit concurrent LLM calls tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt}" ) return results_df results_df = asyncio.run(evaluate_sentiment())
Error handling and retries:
async def robust_evaluation(): results_df = await async_evaluate_dataframe( dataframe=df, evaluators=evaluators, concurrency=3, exit_on_error=False, # Continue despite errors max_retries=5, # Retry failed evaluations tqdm_bar_format="{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}]" ) # Check for failures import json failed_rows = [] for idx, row in results_df.iterrows(): details = json.loads(row['sentiment_execution_details']) if details['status'] != 'success': failed_rows.append(idx) print(f"Failed evaluations: {len(failed_rows)} out of {len(results_df)}") return results_df results_df = asyncio.run(robust_evaluation())
Notes
Score name collisions: If multiple evaluators return scores with the same name, they will write to the same column (e.g., ‘same_name_score’). This can lead to data loss as later scores overwrite earlier ones.
Similarly, evaluator names should be unique to ensure execution_details columns don’t collide.
Failed evaluations: If an evaluation fails, the failure details will be recorded in the execution_details column and the score will be None.
Score#
Score#
- class Score(*, name=None, score=None, label=None, explanation=None, metadata=None, direction='maximize', kind=None)#
Bases:
objectRepresents the result of an evaluation.
A Score contains the evaluation result along with metadata about the evaluation. It can represent numeric scores, categorical labels, explanations, or combinations thereof.
Examples
Creating different types of scores:
from phoenix.evals.evaluators import Score # Numeric score only numeric_score = Score( name="accuracy", score=0.85, kind="llm", direction="maximize" ) # Label only (categorical) label_score = Score( name="sentiment", label="positive", kind="llm", direction="maximize" ) # Score with explanation detailed_score = Score( name="relevance", score=0.9, label="highly_relevant", explanation="The answer directly addresses all aspects of the question", metadata={"model": "gpt-4", "confidence": 0.95}, kind="llm", direction="maximize" ) # Boolean evaluation boolean_score = Score( name="has_citation", score=1.0, label="true", explanation="Found 3 citations in the text", kind="code", direction="maximize" )
- pretty_print(indent=2)#
Pretty print the Score as formatted JSON.
- Parameters:
indent – Number of spaces for indentation. Defaults to 2.
- property source#
The source of this score (deprecated).
- to_dict()#
Convert the Score to a dictionary, excluding None values.
- Returns:
A dictionary representation of the Score with None values excluded.
Built-in Metrics#
- class ConcisenessEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorAn evaluator for assessing whether model outputs are concise and free of unnecessary content.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether the output to an input is concise or verbose.
Returns one Score with label (concise or verbose), score (1.0 if concise, 0.0 if verbose), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.conciseness import ConcisenessEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage conciseness_eval = ConcisenessEvaluator(llm=llm) # With custom invocation parameters conciseness_eval = ConcisenessEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "output": "Paris.", } scores = conciseness_eval.evaluate(eval_input) print(scores) [Score(name='conciseness', score=1.0, label='concise', explanation='The response directly answers the question with no extra words.', metadata={'model': 'gpt-4o-mini'}, kind="llm", direction="maximize")]
- CHOICES = {'concise': 1.0, 'verbose': 0.0}#
- class ConcisenessInputSchema(*, input, output)#
Bases:
BaseModel- input#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output#
- DIRECTION = 'maximize'#
- NAME = 'conciseness'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class CorrectnessEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorAn evaluator for assessing factual accuracy and completeness of model outputs.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether the output to an input is correct or incorrect.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.correctness import CorrectnessEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage correctness_eval = CorrectnessEvaluator(llm=llm) # With custom invocation parameters correctness_eval = CorrectnessEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "output": "Paris is the capital of France.", } scores = correctness_eval.evaluate(eval_input) print(scores) [Score(name='correctness', score=1.0, label='correct', explanation='The response accurately states that Paris is the capital of France.', metadata={'model': 'gpt-4o-mini'}, kind="llm", direction="maximize")]
- CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
- class CorrectnessInputSchema(*, input, output)#
Bases:
BaseModel- input#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output#
- DIRECTION = 'maximize'#
- NAME = 'correctness'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class DocumentRelevanceEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorA specialized evaluator for determining document relevance to a given question.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether a document contains information relevant to answering a specific question.
Returns one Score with label (relevant or unrelated), score (1.0 if relevant, 0.0 if unrelated), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.document_relevance import DocumentRelevanceEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage relevance_eval = DocumentRelevanceEvaluator(llm=llm) # With custom invocation parameters relevance_eval = DocumentRelevanceEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "document_text": "Paris is the capital and largest city of France" } scores = relevance_eval.evaluate(eval_input) print(scores)
- CHOICES = {'relevant': 1.0, 'unrelated': 0.0}#
- DIRECTION = 'maximize'#
- class DocumentRelevanceInputSchema(*, input, document_text)#
Bases:
BaseModel- document_text#
- input#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- NAME = 'document_relevance'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class FaithfulnessEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorA specialized evaluator for detecting faithfulness in grounded LLM responses.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether the output to an input is faithful or unfaithful based on the context.
Returns one Score with label (faithful or unfaithful), score (1.0 if faithful, 0.0 if unfaithful), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.faithfulness import FaithfulnessEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage faithfulness_eval = FaithfulnessEvaluator(llm=llm) # With custom invocation parameters faithfulness_eval = FaithfulnessEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "output": "Paris is the capital of France.", "context": "Paris is the capital and largest city of France." } scores = faithfulness_eval.evaluate(eval_input) print(scores) [Score(name='faithfulness', score=1.0, label='faithful', explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'}, kind="llm", direction="maximize")]
- CHOICES = {'faithful': 1.0, 'unfaithful': 0.0}#
- DIRECTION = 'maximize'#
- class FaithfulnessInputSchema(*, input, output, context)#
Bases:
BaseModel- context#
- input#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output#
- NAME = 'faithfulness'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class HallucinationEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorA specialized evaluator for detecting hallucinations in grounded LLM responses.
Deprecated since version HallucinationEvaluator: is deprecated. Please use FaithfulnessEvaluator instead. The new evaluator uses ‘faithful’/’unfaithful’ labels and maximizes score (1.0=faithful).
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether the output to an input is factual or hallucinated based on the context.
Returns one Score with label (factual or hallucinated), score (1.0 if hallucinated, 0.0 if factual), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.hallucination import HallucinationEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage hallucination_eval = HallucinationEvaluator(llm=llm) # With custom invocation parameters hallucination_eval = HallucinationEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "output": "Paris is the capital of France.", "context": "Paris is the capital and largest city of France." } scores = hallucination_eval.evaluate(eval_input) print(scores) [Score(name='hallucination', score=0.0, label='factual', explanation='Information is supported by context', metadata={'model': 'gpt-4o-mini'}, kind="llm", direction="minimize")]
- CHOICES = {'factual': 0.0, 'hallucinated': 1.0}#
- DIRECTION = 'minimize'#
- class HallucinationInputSchema(*, input, output, context)#
Bases:
BaseModel- context#
- input#
- model_config = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- output#
- NAME = 'hallucination'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class MatchesRegex(pattern, name=None, include_explanation=True)#
Bases:
EvaluatorEvaluates whether text output matches a specified regular expression pattern.
This code evaluator checks if the output contains one or more substrings that match a given regex pattern. It returns a binary score (1.0 for match, 0.0 for no match) along with an explanation of which substrings matched or that no match was found.
- Parameters:
pattern – The regular expression pattern to match against. Can be provided as a string or a compiled Pattern object.
name – Optional custom name for the evaluator. If not provided, defaults to “matches_regex”.
include_explanation – Whether to include an explanation in the Score object. Defaults to True.
Examples
Basic usage with URL detection:
import re from phoenix.evals.metrics.matches_regex import MatchesRegex # Compiled regex pattern pattern = re.compile(r"https?://[^\s]+") contains_link = MatchesRegex(pattern=pattern) eval_input = {"output": "Check out https://github.com/Arize-ai/phoenix!"} scores = contains_link.evaluate(eval_input) print(scores) # [Score(name='matches_regex', # score=1.0, # label=None, # explanation='There are 1 matches for the regex: https?://[^\s]+', # metadata={}, # kind='code', # direction='maximize')]
- class PrecisionRecallFScore(*, beta=1.0, average='macro', zero_division=0.0, positive_label=None)#
Bases:
EvaluatorCode evaluator that computes precision, recall, and F-beta score given lists of expected and output labels.
- Parameters:
beta (-) – Weight of recall relative to precision. Must be > 0. Defaults to 1.0 (F1).
average (-) – Aggregation strategy across classes. One of {‘macro’,’micro’,’weighted’}. Defaults to ‘macro’. Suffixes are only appended to metric names when a non-default average is used.
positive_label (-) – When set, compute binary precision/recall/F exclusively for this label (one-vs-rest). If None and labels are numeric with unique set {0,1}, the positive label defaults to 1. Otherwise, multi-class averaging is used.
zero_division (-) – Value to use when a metric is undefined (e.g., 0/0). Defaults to 0.0.
eval_input (Mapping[str, Any]) – Two lists of hashable labels: - expected (Sequence[Hashable]): Expected/true sequence of labels - output (Sequence[Hashable]): Output/predicted sequence of labels
- Returns:
A list of Score objects.
- Return type:
List[Score]
- Raises:
ValueError – If input validation fails.
Notes
Supports labels as strings or integers (must be hashable)
Supports both binary and multi-class classification via averaging strategies
- Score Naming:
Defaults (beta=1.0, average=”macro”): names are precision, recall, and f1.
Non-default average: e.g., precision_micro, recall_weighted, f0_5_micro.
Examples
Multi-class (macro):
evaluator = PrecisionRecallFScore(beta=1.0, average="macro") eval_input = {"expected": ["cat", "dog", "cat", "bird"], "output": ["cat", "cat", "cat", "bird"]} scores = evaluator(eval_input) [s.name for s in scores] ['precision', 'recall', 'f1']
Binary with explicit positive label:
evaluator = PrecisionRecallFScore(beta=0.5, positive_label="spam") eval_input = {"expected": ["spam", "ham", "spam"], "output": ["spam", "spam", "ham"]} scores = evaluator(eval_input) [s.name for s in scores] ['precision', 'recall', 'f0_5']
- class RefusalEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorAn evaluator for detecting when an LLM refuses or declines to answer a query.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Detects refusals, deflections, scope disclaimers, and non-answers.
Returns one Score with label (refused or answered), score (1.0 if refused, 0.0 if answered), and an explanation from the LLM judge.
This metric is use-case agnostic: it only detects whether a refusal occurred, not whether the refusal was appropriate.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.refusal import RefusalEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage refusal_eval = RefusalEvaluator(llm=llm) # With custom invocation parameters refusal_eval = RefusalEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "What is the capital of France?", "output": "I'm sorry, I can only help with technical questions.", } scores = refusal_eval.evaluate(eval_input) print(scores) [Score(name='refusal', score=1.0, label='refused', explanation='The response refuses to answer by claiming scope limitations.', metadata={'model': 'gpt-4o-mini'}, kind="llm", direction="neutral")]
- CHOICES = {'answered': 0.0, 'refused': 1.0}#
- DIRECTION = 'neutral'#
- NAME = 'refusal'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class ToolInvocationEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorDetermines if a tool was invoked correctly with proper arguments, formatting, and safe content.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether an AI agent’s tool invocation was correct or incorrect based on the conversation context, available tool schemas, and the agent’s tool invocation(s).
This metric evaluates the correctness of the tool invocation (arguments, formatting, safety), not the correctness of the tool selection itself.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
- Criteria for Correct Invocation:
JSON is properly structured (if applicable).
All required fields/parameters are present.
No hallucinated or nonexistent fields (all fields exist in the tool schema).
Argument values match the user query and schema expectations.
No unsafe content (e.g., PII) in arguments.
- Criteria for Incorrect Invocation:
Hallucinated or nonexistent fields not in the schema.
Missing required fields/parameters.
Improperly formatted or malformed JSON.
Incorrect, hallucinated, or mismatched argument values.
Unsafe content (e.g., PII, sensitive data) in arguments.
Examples:
from phoenix.evals.metrics.tool_invocation import ToolInvocationEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage tool_invocation_eval = ToolInvocationEvaluator(llm=llm) # With custom invocation parameters tool_invocation_eval = ToolInvocationEvaluator(llm=llm, temperature=0.0) # Example with JSON schema format for available tools eval_input = { "input": "User: Book a flight from NYC to LA for tomorrow", "available_tools": ''' { "name": "book_flight", "description": "Book a flight between two cities", "parameters": { "type": "object", "properties": { "origin": {"type": "string", "description": "Departure city code"}, "destination": {"type": "string", "description": "Arrival city code"}, "date": {"type": "string", "description": "Flight date in YYYY-MM-DD"} }, "required": ["origin", "destination", "date"] } } ''', "tool_selection": ''' book_flight(origin="NYC", destination="LA", date="2024-01-15") ''' } scores = tool_invocation_eval.evaluate(eval_input) print(scores) # Example with human-readable format for available tools eval_input_readable = { "input": "User: What's the weather in San Francisco?", "available_tools": ''' WeatherTool: Description: Get the current weather for a location Parameters: - location (required): The city name or coordinates - units (optional): Temperature units (celsius or fahrenheit) ''', "tool_selection": "WeatherTool(location='San Francisco', units='fahrenheit')" } scores = tool_invocation_eval.evaluate(eval_input_readable) print(scores)
- CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
- DIRECTION = 'maximize'#
- NAME = 'tool_invocation'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class ToolResponseHandlingEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorDetermines if an AI agent properly handled a tool’s response, including error handling, data extraction, transformation, and safe information disclosure.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether an AI agent correctly processed the tool result to produce an appropriate output.
This metric evaluates what happens AFTER the tool returns, NOT whether the right tool was selected (tool_selection) or invoked correctly (tool_invocation).
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.tool_response_handling import ToolResponseHandlingEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage tool_response_eval = ToolResponseHandlingEvaluator(llm=llm) # With custom invocation parameters tool_response_eval = ToolResponseHandlingEvaluator(llm=llm, temperature=0.0) # Example: Correct extraction from tool result eval_input = { "input": "What's the weather in Seattle?", "tool_call": 'get_weather(location="Seattle")', "tool_result": '{"temperature": 58, "conditions": "cloudy"}', "output": "Seattle is currently 58°F and cloudy." } scores = tool_response_eval.evaluate(eval_input) print(scores) # Example: Hallucinated data (incorrect) eval_input_hallucinated = { "input": "What restaurants are nearby?", "tool_call": 'search_restaurants(location="downtown")', "tool_result": '{"results": [{"name": "Cafe Luna", "rating": 4.2}]}', "output": "I found Cafe Luna and Mario's Italian nearby." } scores = tool_response_eval.evaluate(eval_input_hallucinated) print(scores) # Should be incorrect - Mario's Italian was hallucinated # Example: Error handling with retry eval_input_retry = { "input": "Find my recent orders", "tool_call": "get_orders(user_id='123')", "tool_result": '{"error": "rate_limit_exceeded", "retry_after": 30}', "output": "[Retried] Your order (ORD-001) has shipped." } scores = tool_response_eval.evaluate(eval_input_retry) print(scores)
- CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
- DIRECTION = 'maximize'#
- NAME = 'tool_response_handling'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
- class ToolSelectionEvaluator(llm, **kwargs)#
Bases:
ClassificationEvaluatorA specialized evaluator for determining if the correct tool was selected for a given context.
- Parameters:
llm (LLM) – The LLM instance to use for the evaluation.
**kwargs – Additional invocation parameters forwarded to the LLM client (e.g.,
temperature=0.0,max_tokens=256).
Notes
Evaluates whether an AI agent’s tool selection was correct or incorrect based on the conversation context, available tools, and the agent’s tool invocations.
The agent’s tool selection can be a single tool or a list of tools.
This metric evaluates the correctness of the tool selection, not the correctness of the tool invocations or the tool outputs.
Returns one Score with label (correct or incorrect), score (1.0 if correct, 0.0 if incorrect), and an explanation from the LLM judge.
Requires an LLM that supports tool calling or structured output.
Examples:
from phoenix.evals.metrics.tool_selection import ToolSelectionEvaluator from phoenix.evals import LLM llm = LLM(provider="openai", model="gpt-4o-mini") # Default usage tool_selection_eval = ToolSelectionEvaluator(llm=llm) # With custom invocation parameters tool_selection_eval = ToolSelectionEvaluator(llm=llm, temperature=0.0) eval_input = { "input": "User: What is the weather in San Francisco?", "available_tools": ( "WeatherTool: Get the current weather for a location.\n" "NewsTool: Stay connected to global events with our up-to-date news.\n" "MusicTool: Create playlists, search for music, and check music trends." ), "tool_selection": "WeatherTool(location='San Francisco')" # input args optional } scores = tool_selection_eval.evaluate(eval_input) print(scores)
- CHOICES = {'correct': 1.0, 'incorrect': 0.0}#
- DIRECTION = 'maximize'#
- NAME = 'tool_selection'#
- PROMPT = <phoenix.evals.llm.prompts.PromptTemplate object>#
Utilities#
- default_tqdm_progress_bar_formatter(title)#
Returns a progress bar formatter for use with tqdm.
- Parameters:
title (str) – The title of the progress bar, displayed as a prefix.
- Returns:
A formatter to be passed to the bar_format argument of tqdm.
- Return type:
str
- extract_with_jsonpath(data, path, match_all=False)#
Extract a value from a nested JSON structure using jsonpath-ng.
- Parameters:
data – The input dictionary to be extracted from.
path – The jsonpath to extract from the data.
match_all – If True, return a list of all matches. By default, return only the first match.
- Returns:
The extracted value (can be None).
- Raises:
JsonPathParserError – If the path is not parseable (invalid syntax).
ValueError – If the path is invalid or not found (missing key, index out of bounds, etc).
- remap_eval_input(eval_input, required_fields, input_mapping=None)#
Remap eval_input keys based on required_fields and an optional input_mapping.
- Parameters:
eval_input – The input dictionary to be remapped.
required_fields – The required field names as a set of strings.
input_mapping – Optional mapping from evaluator-required field -> eval_input key.
- Returns:
A dictionary with keys as required_fields and values from eval_input.
- Raises:
ValueError – If a required field is missing in eval_input or has a null/empty value.
- to_annotation_dataframe(dataframe, score_names=None)#
Format scores as annotations for logging to Phoenix.
This function takes the output of evaluate_dataframe, and a list of score names, formats it for Phoenix logging. If no score names are provided, the function will extract all scores from the dataframe (_score columns). Score, label, explanation, and metadata are extracted from the score column and exploded into separate columns. Annotation name and kind are also added as columns.
- Parameters:
dataframe (pd.DataFrame) – DataFrame returned by (async_)evaluate_dataframe
score_names (List[str]) – Names of the score columns to log (e.g., [“precision”,
None ("hallucination"]). If)
used. (all columns ending with _score will be)
- Returns:
DataFrame with the score column, annotation name, and annotator kind columns for the specified score names.
- Return type:
pd.DataFrame
Examples:
from phoenix.client import Client from phoenix.evals import evaluate_dataframe from phoenix.evals.utils import to_annotation_dataframe client = Client() results = evaluate_dataframe(df, evaluators) # Log only hallucination annotations hallucination_annotations = to_annotation_dataframe(results, ["hallucination"]) client.spans.log_span_annotations_dataframe(dataframe=hallucination_annotations) # Log all scores as annotations all_annotations = to_annotation_dataframe(results) client.spans.log_span_annotations_dataframe(dataframe=all_annotations)