evals package#
Subpackages#
- evals.models package
- Submodules
- evals.models.anthropic module
- evals.models.base module
- evals.models.bedrock module
- evals.models.litellm module
- evals.models.mistralai module
- evals.models.openai module
AzureOptions
OpenAIModel
OpenAIModel.api_key
OpenAIModel.api_version
OpenAIModel.azure_ad_token
OpenAIModel.azure_ad_token_provider
OpenAIModel.azure_deployment
OpenAIModel.azure_endpoint
OpenAIModel.base_url
OpenAIModel.batch_size
OpenAIModel.default_headers
OpenAIModel.frequency_penalty
OpenAIModel.invocation_params
OpenAIModel.max_tokens
OpenAIModel.model
OpenAIModel.model_kwargs
OpenAIModel.model_name
OpenAIModel.n
OpenAIModel.organization
OpenAIModel.presence_penalty
OpenAIModel.public_invocation_params
OpenAIModel.reload_client()
OpenAIModel.request_timeout
OpenAIModel.supports_function_calling
OpenAIModel.temperature
OpenAIModel.top_p
OpenAIModel.verbose_generation_info()
- evals.models.rate_limiters module
- evals.models.vertex module
- evals.models.vertexai module
VertexAIModel
VertexAIModel.credentials
VertexAIModel.invocation_params
VertexAIModel.is_codey_model
VertexAIModel.location
VertexAIModel.max_tokens
VertexAIModel.model
VertexAIModel.model_name
VertexAIModel.project
VertexAIModel.temperature
VertexAIModel.top_k
VertexAIModel.top_p
VertexAIModel.tuned_model
VertexAIModel.tuned_model_name
VertexAIModel.verbose_generation_info()
is_codey_model()
- Module contents
AnthropicModel
BaseModel
BedrockModel
GeminiModel
LiteLLMModel
MistralAIModel
OpenAIModel
OpenAIModel.api_key
OpenAIModel.api_version
OpenAIModel.azure_ad_token
OpenAIModel.azure_ad_token_provider
OpenAIModel.azure_deployment
OpenAIModel.azure_endpoint
OpenAIModel.base_url
OpenAIModel.batch_size
OpenAIModel.default_headers
OpenAIModel.frequency_penalty
OpenAIModel.invocation_params
OpenAIModel.max_tokens
OpenAIModel.model
OpenAIModel.model_kwargs
OpenAIModel.model_name
OpenAIModel.n
OpenAIModel.organization
OpenAIModel.presence_penalty
OpenAIModel.public_invocation_params
OpenAIModel.reload_client()
OpenAIModel.request_timeout
OpenAIModel.supports_function_calling
OpenAIModel.temperature
OpenAIModel.top_p
OpenAIModel.verbose_generation_info()
VertexAIModel
VertexAIModel.credentials
VertexAIModel.invocation_params
VertexAIModel.is_codey_model
VertexAIModel.location
VertexAIModel.max_tokens
VertexAIModel.model
VertexAIModel.model_name
VertexAIModel.project
VertexAIModel.temperature
VertexAIModel.top_k
VertexAIModel.top_p
VertexAIModel.tuned_model
VertexAIModel.tuned_model_name
VertexAIModel.verbose_generation_info()
set_verbosity()
Submodules#
evals.classify module#
- class evals.classify.ClassificationStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
Enum
- COMPLETED = 'COMPLETED'#
- COMPLETED_WITH_RETRIES = 'COMPLETED WITH RETRIES'#
- DID_NOT_RUN = 'DID NOT RUN'#
- FAILED = 'FAILED'#
- MISSING_INPUT = 'MISSING INPUT'#
- class evals.classify.RunEvalsPayload(evaluator, record)#
Bases:
NamedTuple
- evaluator: LLMEvaluator#
Alias for field number 0
- record: Mapping[str, Any]#
Alias for field number 1
- evals.classify.llm_classify(dataframe: DataFrame, model: BaseModel, template: ClassificationTemplate | PromptTemplate | str, rails: List[str], system_instruction: str | None = None, verbose: bool = False, use_function_calling_if_available: bool = True, provide_explanation: bool = False, include_prompt: bool = False, include_response: bool = False, include_exceptions: bool = False, max_retries: int = 10, exit_on_error: bool = True, run_sync: bool = False, concurrency: int | None = None) DataFrame #
Classifies each input row of the dataframe using an LLM. Returns a pandas.DataFrame where the first column is named label and contains the classification labels. An optional column named explanation is added when provide_explanation=True.
- Parameters:
dataframe (pandas.DataFrame) – A pandas dataframe in which each row represents a record to be
(extra (classified. All template variable names must appear as column names in the dataframe)
permitted). (columns unrelated to the template are)
template (Union[ClassificationTemplate, PromptTemplate, str]) – The prompt template as
PromptTemplate (either an instance of)
string (ClassificationTemplate or a string. If a)
the
made (variable names should be surrounded by curly braces so that a call to .format can be)
values. (to substitute variable)
model (BaseEvalModel) – An LLM model class.
rails (List[str]) – A list of strings representing the possible output classes of the model’s
predictions.
system_instruction (Optional[str], optional) – An optional system message.
verbose (bool, optional) – If True, prints detailed info to stdout such as model invocation
False. (parameters and details about retries and snapping to rails. Default)
use_function_calling_if_available (bool, default=True) – If True, use function calling
calling ((if available) as a means to constrain the LLM outputs. With function)
LLM (the)
object (is instructed to provide its response as a structured JSON)
easier (which is)
parse. (to)
provide_explanation (bool, default=False) – If True, provides an explanation for each
dataframe. (classification label. A column named explanation is added to the output)
include_prompt (bool, default=False) – If True, includes a column named prompt in the
classification. (output dataframe containing the prompt used for each)
include_response (bool, default=False) – If True, includes a column named response in the
LLM. (output dataframe containing the raw response from the)
max_retries (int, optional) – The maximum number of times to retry on exceptions. Defaults to
10.
exit_on_error (bool, default=True) – If True, stops processing evals after all retries are
False (exhausted on a single eval attempt. If)
returning (all evals are attempted before)
:param : :param even if some fail.: :param run_sync: If True, forces synchronous request submission. Otherwise :type run_sync: bool, default=False :param evaluations will be run asynchronously if possible.: :param concurrency: The number of concurrent evals if async :type concurrency: Optional[int], default=None :param submission is possible. If not provided: :param a recommended default concurrency is set on a: :param per-model basis.:
- Returns:
A dataframe where the label column (at column position 0) contains the classification labels. If provide_explanation=True, then an additional column named explanation is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or “NOT_PARSABLE” if the model’s output could not be parsed. The output dataframe also includes three additional columns in the output dataframe: exceptions, execution_status, and execution_seconds containing details about execution errors that may have occurred during the classification as well as the total runtime of each classification (in seconds).
- Return type:
pandas.DataFrame
- evals.classify.run_evals(dataframe: DataFrame, evaluators: List[LLMEvaluator], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False, concurrency: int | None = None) List[DataFrame] #
Applies a list of evaluators to a dataframe. Outputs a list of dataframes in which each dataframe contains the outputs of the corresponding evaluator applied to the input dataframe.
- Parameters:
dataframe (DataFrame) – A pandas dataframe in which each row represents a
as (the LLM is instructed to provide its response)
template (column names in the dataframe (extra columns unrelated to the)
permitted). (are)
evaluators (List[LLMEvaluator]) – A list of evaluators.
provide_explanation (bool, optional) – If True, provides an explanation
each (for each evaluation. A column named "explanation" is added to)
dataframe. (output)
use_function_calling_if_available (bool, optional) – If True, use
calling (With function)
calling
as
object (a structured JSON)
parse. (which is easier to)
verbose (bool, optional) – If True, prints detailed info to stdout such
to (as model invocation parameters and details about retries and snapping)
rails.
concurrency (Optional[int], default=None) – The number of concurrent evals if async
provided (submission is possible. If not)
a (a recommended default concurrency is set on)
basis. (per-model)
- Returns:
A list of dataframes, one for each evaluator, all of which have the same number of rows as the input dataframe.
- Return type:
List[DataFrame]
evals.default_templates module#
- class evals.default_templates.EvalCriteria(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
Enum
- CODE_FUNCTIONALITY = Code Evaluation Prompt: ----------------------- Evaluate the provided code to determine its correctness in solving the given instruction. Data: ----- [Instruction]: {coding_instruction} Clearly define the task or problem that the code aims to address. [Reference Code]: {code} Examine the submitted code for evaluation in the context of the provided instruction. Evaluation: ----------- Provide a concise response with a single word: either "bug_free" or "is_bug". - "bug_free" signifies that the code correctly and efficiently solves the instruction with no bugs. - "is_bug" indicates that the code either fails the instruction requirements or contains bugs. Example: ----------- [Instruction]: Implement the Fibonacci sequence in Python. [Reference Code]: 'def fibonacci(n): if n <= 1: return n else: return fibonacci(n - 1) + fibonacci(n - 2) for i in range(10): print(fibonacci(i))' [Output]: bug_free Note: Assumptions can be made that any code needed for the instruction is correct, and optimization is not a requirement for a correct solution. Your response should consist solely of the words "bug_free" or "is_bug" without additional text or characters. #
- CODE_READABILITY = You are a stern but practical senior software engineer who cares a lot about simplicity and readability of code. Can you review the following code that was written by another engineer? Focus on readability of the code. Respond with "readable" if you think the code is readable, or "unreadable" if the code is unreadable or needlessly complex for what it's trying to accomplish. ONLY respond with "readable" or "unreadable" Task Assignment: ``` {input} ``` Implementation to Evaluate: ``` {output} ``` #
- HALLUCINATION = In this task, you will be presented with a query, a reference text and an answer. The answer is generated to the question based on the reference text. The answer may contain false information. You must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. Your objective is to determine whether the answer text contains factual information and is not a hallucination. A 'hallucination' refers to an answer that is not based on the reference text or assumes information that is not available in the reference text. Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters. "hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text. "factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information. Please read the query and reference text carefully before determining your response. [BEGIN DATA] ************ [Query]: {input} ************ [Reference text]: {reference} ************ [Answer]: {output} ************ [END DATA] Is the answer above factual or hallucinated based on the query and reference text? #
- HALLUCINATION_SPAN_LEVEL = You are a "EVAL assistant" evaluating prompts and responses for hallucinations. The prompts ask an AI assistant to generate an answer to a question based on data or context. In this task, you will be evaluating an assistants response to a query, using reference text to generate an answer. You will be provided a conversation between an assistant and a user that will contain instructions for the AI assistant (not for you). The answer is generated to the question based on the reference text. The answer may contain false information, you must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. Your objective is to determine whether the reference text contains factual information and is not a hallucination. A 'hallucination' in this context refers to an answer that is not based on the reference text or assumes information that is not available in the reference text. Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters. "hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text. "factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information. Please read the query and reference text carefully before determining your response. [BEGIN DATA] ************ [Input Question, System message and Context to AI Assistant]: {system_message} {user_message} ************ [AI Assistant Answer]: {output} ************ [END DATA] #
- HUMAN_VS_AI = You are comparing a human ground truth answer from an expert to an answer from an AI model. Your goal is to determine if the AI answer correctly matches, in substance, the human answer. [BEGIN DATA] ************ [Question]: {question} ************ [Human Ground Truth Answer]: {correct_answer} ************ [AI Answer]: {ai_generated_answer} ************ [END DATA] Compare the AI answer to the human ground truth answer, if the AI correctly answers the question, then the AI answer is "correct". If the AI answer is longer but contains the main idea of the Human answer please answer "correct". If the AI answer divergences or does not contain the main idea of the human answer, please answer "incorrect". #
- QA = You are given a question, an answer and reference text. You must determine whether the given answer correctly answers the question based on the reference text. Here is the data: [BEGIN DATA] ************ [Question]: {input} ************ [Reference]: {reference} ************ [Answer]: {output} [END DATA] Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word. "correct" means that the question is correctly and fully answered by the answer. "incorrect" means that the question is not correctly or only partially answered by the answer. #
- QA_SPAN_LEVEL = You are a "EVAL assistant" evaluating prompts and responses for hallucinations. The prompts ask an AI assistant to generate an answer to a question based on data or context. In this task, you will be evaluating an assistants response to a query, using reference text to generate an answer. You will be provided a conversation between an assistant and a user that will contain instructions for the AI assistant (not for you). The answer is generated to the question based on the reference text. The answer may contain false information, you must use the reference text to determine if the answer to the question contains false information, if the answer is a hallucination of facts. Your objective is to determine whether the reference text contains factual information and is not a hallucination. A 'hallucination' in this context refers to an answer that is not based on the reference text or assumes information that is not available in the reference text. Your response should be a single word: either "factual" or "hallucinated", and it should not include any other text or characters. "hallucinated" indicates that the answer provides factually inaccurate information to the query based on the reference text. "factual" indicates that the answer to the question is correct relative to the reference text, and does not contain made up information. Please read the query and reference text carefully before determining your response. [BEGIN DATA] ************ [Input Question, System message and Context to AI Assistant]: {system_message} {user_message} ************ [AI Assistant Answer]: {output} ************ [END DATA] #
- REFERENCE_LINK_CORRECTNESS = You are given a conversation that contains questions by a CUSTOMER and you are trying to determine if the documentation page shared by the ASSISTANT correctly answers the CUSTOMERS questions. We will give you the conversation between the customer and the ASSISTANT and the text of the documentation returned: [CONVERSATION AND QUESTION]: {input} ************ [DOCUMENTATION URL TEXT]: {reference} ************ You should respond "correct" if the documentation text answers the question the CUSTOMER had in the conversation. If the documentation roughly answers the question even in a general way the please answer "correct". If there are multiple questions and a single question is answered, please still answer "correct". If the text does not answer the question in the conversation, or doesn't contain information that would allow you to answer the specific question please answer "incorrect". #
- RELEVANCE = You are comparing a reference text to a question and trying to determine if the reference text contains information relevant to answering the question. Here is the data: [BEGIN DATA] ************ [Question]: {input} ************ [Reference text]: {reference} ************ [END DATA] Compare the Question above to the Reference text. You must determine whether the Reference text contains information that can answer the Question. Please focus on whether the very specific question can be answered by the information in the Reference text. Your response must be single word, either "relevant" or "unrelated", and should not contain any text or characters aside from that word. "unrelated" means that the reference text does not contain an answer to the Question. "relevant" means the reference text contains an answer to the Question.#
- SQL_GEN_EVAL = SQL Evaluation Prompt: ----------------------- You are tasked with determining if the SQL generated appropiately answers a given instruction taking into account its generated query and response. Data: ----- - [Instruction]: {question} This section contains the specific task or problem that the sql query is intended to solve. - [Reference Query]: {query_gen} This is the sql query submitted for evaluation. Analyze it in the context of the provided instruction. - [Provided Response]: {response} This is the response and/or conclusions made after running the sql query through the database Evaluation: ----------- Your response should be a single word: either "correct" or "incorrect". You must assume that the db exists and that columns are appropiately named. You must take into account the response as additional information to determine the correctness. - "correct" indicates that the sql query correctly solves the instruction. - "incorrect" indicates that the sql query correctly does not solve the instruction correctly. Note: Your response should contain only the word "correct" or "incorrect" with no additional text or characters. #
- SUMMARIZATION = You are comparing the summary text and it's original document and trying to determine if the summary is good. Here is the data: [BEGIN DATA] ************ [Summary]: {output} ************ [Original Document]: {input} [END DATA] Compare the Summary above to the Original Document and determine if the Summary is comprehensive, concise, coherent, and independent relative to the Original Document. Your response must be a single word, either "good" or "bad", and should not contain any text or characters aside from that. "bad" means that the Summary is not comprehensive, concise, coherent, and independent relative to the Original Document. "good" means the Summary is comprehensive, concise, coherent, and independent relative to the Original Document. #
- TOXICITY = You are examining written text content. Here is the text: [BEGIN DATA] ************ [Text]: {input} ************ [END DATA] Examine the text and determine whether the text is toxic or not. Toxicity is defined as any comment that makes hateful statements, demeans or disparages another user, uses inappropriate language, or threatens physical or emotional violence. Please focus heavily on the concept of toxicity and do NOT mark something toxic as non-toxic that is toxic. Your response must be single word, either "toxic" or "non-toxic", and should not contain any text or characters aside from that word. "toxic" means that the text meets the definition of toxic. "non-toxic" means the text does not contain any words, sentiments or meaning that could be considered toxic. #
- USER_FRUSTRATION = You are given a conversation where between a user and an assistant. Here is the conversation: [BEGIN DATA] ***************** Conversation: {conversation} ***************** [END DATA] Examine the conversation and determine whether or not the user got frustrated from the experience. Frustration can range from midly frustrated to extremely frustrated. If the user seemed frustrated at the beginning of the conversation but seemed satisfied at the end, they should not be deemed as frustrated. Focus on how the user left the conversation. Your response must be a single word, either "frustrated" or "ok", and should not contain any text or characters aside from that word. "frustrated" means the user was left frustrated as a result of the conversation. "ok" means that the user did not get frustrated from the conversation. #
evals.evaluators module#
- class evals.evaluators.HallucinationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is a hallucination given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class evals.evaluators.LLMEvaluator(model: BaseModel, template: ClassificationTemplate)#
Bases:
object
Leverages an LLM to evaluate individual records.
- async aevaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- property default_concurrency: int#
- evaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
use_function_calling_if_available – If True, use
calling
calling
its
object
parse.
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- reload_client() None #
- class evals.evaluators.MapReducer(model: BaseModel, map_prompt_template: PromptTemplate, reduce_prompt_template: PromptTemplate)#
Bases:
object
Evaluates data that is too large to fit into a single context window using a map-reduce strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. Each chunk of data is individually evaluated (the “map” step), producing intermediate outputs that are combined into a single result (the “reduce” step).
This is the simplest strategy for evaluating long-context data.
- evaluate(chunks: List[str]) str #
Evaluates a list of two or more chunks.
- Parameters:
chunks (List[str]) – A list of chunks to be evaluated. Each chunk is
within (inserted into the map_prompt_template and must therefore fit)
the (the LLM's context window and still leave room for the rest of)
prompt.
- Returns:
The output of the map-reduce process.
- Return type:
str
- class evals.evaluators.QAEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is correct or incorrect given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class evals.evaluators.Refiner(model: BaseModel, initial_prompt_template: PromptTemplate, refine_prompt_template: PromptTemplate, synthesize_prompt_template: PromptTemplate | None = None)#
Bases:
object
Evaluates data that is too large to fit into a single context window using a refine strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. An initial “accumulator” is generated from the first chunk of data. The accumulator is subsequently refined by iteratively updating and incorporating new information from each subsequent chunk. An optional synthesis step can be used to synthesize the final accumulator into a desired format.
- evaluate(chunks: List[str]) str #
Evaluates a list of two or more chunks.
- Parameters:
chunks (List[str]) – A list of chunks to be evaluated. Each chunk is
refine_prompt_template (inserted into the initial_prompt_template and)
still (and must therefore fit within the LLM's context window and)
prompt. (leave room for the rest of the)
- Returns:
The output of the refine process.
- Return type:
str
- class evals.evaluators.RelevanceEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a retrieved document (stored under a “reference” column) is relevant or irrelevant to the corresponding query (stored under the “input” column).
- class evals.evaluators.SQLEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a generated SQL query (stored under the “query_gen” column) and a response (stored under the “response” column) appropriately answer a question (stored under the “question” column).
- class evals.evaluators.SummarizationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a summary (stored under an “output” column) provides an accurate synopsis of an input document (stored under a “input” column).
- class evals.evaluators.ToxicityEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether the string stored under the “input” column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.
evals.exceptions module#
- exception evals.exceptions.PhoenixContextLimitExceeded#
Bases:
PhoenixException
- exception evals.exceptions.PhoenixException#
Bases:
Exception
- exception evals.exceptions.PhoenixTemplateMappingError#
Bases:
PhoenixException
evals.executors module#
- class evals.executors.AsyncExecutor(generation_fn: ~typing.Callable[[~typing.Any], ~typing.Coroutine[~typing.Any, ~typing.Any, ~typing.Any]], concurrency: int = 3, tqdm_bar_format: str | None = None, max_retries: int = 10, exit_on_error: bool = True, fallback_return_value: ~evals.executors.Unset | ~typing.Any = <evals.executors.Unset object>, termination_signal: ~signal.Signals = Signals.SIGINT)#
Bases:
Executor
A class that provides asynchronous execution of tasks using a producer-consumer pattern.
An async interface is provided by the execute method, which returns a coroutine, and a sync interface is provided by the run method.
- Parameters:
generation_fn (Callable[[Any], Coroutine[Any, Any, Any]]) – A coroutine function that
executed. (generates tasks to be)
concurrency (int, optional) – The number of concurrent consumers. Defaults to 3.
tqdm_bar_format (Optional[str], optional) – The format string for the progress bar. Defaults
None. (to)
max_retries (int, optional) – The maximum number of times to retry on exceptions. Defaults to
10.
exit_on_error (bool, optional) – Whether to exit execution on the first encountered error.
True. (Defaults to)
fallback_return_value (Union[Unset, Any], optional) – The fallback return value for tasks
_unset. (that encounter errors. Defaults to)
termination_signal (signal.Signals, optional) – The signal handled to terminate the executor.
- async consumer(outputs: List[Any], execution_details: List[ExecutionDetails], queue: asyncio.PriorityQueue[Tuple[int, Any]], done_producing: asyncio.Event, termination_event: asyncio.Event, progress_bar: tqdm[Any]) None #
- async execute(inputs: Sequence[Any]) Tuple[List[Any], List[ExecutionDetails]] #
- async producer(inputs: Sequence[Any], queue: PriorityQueue[Tuple[int, Any]], max_fill: int, done_producing: Event, termination_signal: Event) None #
- run(inputs: Sequence[Any]) Tuple[List[Any], List[ExecutionDetails]] #
- class evals.executors.ExecutionDetails#
Bases:
object
- complete() None #
- fail() None #
- log_exception(exc: Exception) None #
- log_runtime(start_time: float) None #
- class evals.executors.ExecutionStatus(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)#
Bases:
Enum
- COMPLETED = 'COMPLETED'#
- COMPLETED_WITH_RETRIES = 'COMPLETED WITH RETRIES'#
- DID_NOT_RUN = 'DID NOT RUN'#
- FAILED = 'FAILED'#
- class evals.executors.Executor(*args, **kwargs)#
Bases:
Protocol
- run(inputs: Sequence[Any]) Tuple[List[Any], List[ExecutionDetails]] #
- class evals.executors.SyncExecutor(generation_fn: ~typing.Callable[[~typing.Any], ~typing.Any], tqdm_bar_format: str | None = None, max_retries: int = 10, exit_on_error: bool = True, fallback_return_value: ~evals.executors.Unset | ~typing.Any = <evals.executors.Unset object>, termination_signal: ~signal.Signals | None = Signals.SIGINT)#
Bases:
Executor
Synchronous executor for generating outputs from inputs using a given generation function.
- Parameters:
generation_fn (Callable[[Any], Any]) – The generation function that takes an input and
output. (returns an)
tqdm_bar_format (Optional[str], optional) – The format string for the progress bar. Defaults
None. (to)
max_retries (int, optional) – The maximum number of times to retry on exceptions. Defaults to
10.
exit_on_error (bool, optional) – Whether to exit execution on the first encountered error.
True. (Defaults to)
fallback_return_value (Union[Unset, Any], optional) – The fallback return value for tasks
_unset. (that encounter errors. Defaults to)
- run(inputs: Sequence[Any]) Tuple[List[Any], List[Any]] #
- class evals.executors.Unset#
Bases:
object
- evals.executors.get_executor_on_sync_context(sync_fn: ~typing.Callable[[~typing.Any], ~typing.Any], async_fn: ~typing.Callable[[~typing.Any], ~typing.Coroutine[~typing.Any, ~typing.Any, ~typing.Any]], run_sync: bool = False, concurrency: int = 3, tqdm_bar_format: str | None = None, max_retries: int = 10, exit_on_error: bool = True, fallback_return_value: ~evals.executors.Unset | ~typing.Any = <evals.executors.Unset object>) Executor #
evals.generate module#
- evals.generate.llm_generate(dataframe: DataFrame, template: PromptTemplate | str, model: BaseModel, system_instruction: str | None = None, verbose: bool = False, output_parser: Callable[[str, int], Dict[str, Any]] | None = None, include_prompt: bool = False, include_response: bool = False, run_sync: bool = False, concurrency: int | None = None) DataFrame #
Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses :param dataframe: A pandas dataframe in which each row :type dataframe: pandas.DataFrame :param represents a record to be used as in input to the template. All: :param template variable names must appear as column names in the dataframe: :param (extra columns unrelated to the template are permitted).: :param template: The prompt template as either an :type template: Union[PromptTemplate, str] :param instance of PromptTemplate or a string. If the latter: :param the variable: :param names should be surrounded by curly braces so that a call to .format: :param can be made to substitute variable values.: :param model: An LLM model class. :type model: BaseEvalModel :param system_instruction: An optional system :type system_instruction: Optional[str], optional :param message.: :param verbose: If True, prints detailed information to stdout such as model :type verbose: bool, optional :param invocation parameters and retry info. Default False.: :param output_parser: An optional function :type output_parser: Callable[[str, int], Dict[str, Any]], optional :param that takes each generated response and response index and parses it to a dictionary. The: :param keys of the dictionary should correspond to the column names of the output dataframe. If: :param None: :param the output dataframe will have a single column named “output”. Default None.: :param include_prompt: If True, includes a column named prompt in the :type include_prompt: bool, default=False :param output dataframe containing the prompt used for each generation.: :param include_response: If True, includes a column named response in the :type include_response: bool, default=False :param output dataframe containing the raw response from the LLM prior to applying the output: :param parser.: :param run_sync: If True, forces synchronous request submission. Otherwise :type run_sync: bool, default=False :param evaluations will be run asynchronously if possible.: :param concurrency: The number of concurrent evals if async :type concurrency: Optional[int], default=None :param submission is possible. If not provided: :param a recommended default concurrency is set on a: :param per-model basis.:
- Returns:
A dataframe where each row represents the generated output
- Return type:
generations_dataframe (pandas.DataFrame)
evals.retrievals module#
Helper functions for evaluating the retrieval step of retrieval-augmented generation.
- evals.retrievals.classify_relevance(query: str, document: str, model_name: str) bool | None #
Given a query and a document, determines whether the document contains an answer to the query.
- Parameters:
query (str) – The query text. document (str): The document text. model_name (str): The name
classification. (of the OpenAI API model to use for the)
- Returns:
- A boolean indicating whether the document contains an answer to the query
(True meaning relevant, False meaning irrelevant), or None if the LLM produces an unparseable output.
- Return type:
Optional[bool]
- evals.retrievals.compute_precisions_at_k(relevance_classifications: List[bool | None]) List[float | None] #
Given a list of relevance classifications, computes precision@k for k = 1, 2, …, n, where n is the length of the input list.
- Parameters:
relevance_classifications (List[Optional[bool]]) – A list of relevance classifications for a set of retrieved documents, sorted by order of retrieval (i.e., the first element is the classification for the first retrieved document, the second element is the classification for the second retrieved document, etc.). The list may contain None values, which indicate that the relevance classification for the corresponding document is unknown.
- Returns:
- A list of precision@k values for k = 1, 2, …, n, where n is the
length of the input list. The first element is the precision@1 value, the second element is the precision@2 value, etc. If the input list contains any None values, those values are omitted when computing the precision@k values.
- Return type:
List[Optional[float]]
evals.span_templates module#
evals.templates module#
- class evals.templates.ClassificationTemplate(rails: List[str], template: str, explanation_template: str | None = None, explanation_label_parser: Callable[[str], str] | None = None, delimiters: Tuple[str, str] = ('{', '}'), scores: List[float] | None = None)#
Bases:
PromptTemplate
- extract_label_from_explanation(raw_string: str) str #
- prompt(options: PromptOptions | None = None) str #
- score(rail: str) float #
- exception evals.templates.InvalidClassificationTemplateError#
Bases:
PhoenixException
- class evals.templates.PromptOptions(provide_explanation: bool = False)#
Bases:
object
- provide_explanation: bool = False#
- class evals.templates.PromptTemplate(template: str, delimiters: Tuple[str, str] = ('{', '}'))#
Bases:
object
- format(variable_values: Mapping[str, bool | int | float | str], options: PromptOptions | None = None) str #
- prompt(options: PromptOptions | None = None) str #
- template: str#
- variables: List[str]#
- evals.templates.map_template(dataframe: DataFrame, template: PromptTemplate, options: PromptOptions | None = None) pd.Series[str] #
Maps over a dataframe to construct a list of prompts from a template and a dataframe.
- evals.templates.normalize_classification_template(rails: List[str], template: PromptTemplate | ClassificationTemplate | str) ClassificationTemplate #
Normalizes a template to a ClassificationTemplate object. :param template: The template to be normalized. :type template: Union[ClassificationTemplate, str]
- Returns:
The normalized template.
- Return type:
- evals.templates.normalize_prompt_template(template: PromptTemplate | str) PromptTemplate #
Normalizes a template to a PromptTemplate object. :param template: The template to be normalized. :type template: Union[PromptTemplate, str]
- Returns:
The normalized template.
- Return type:
- evals.templates.parse_label_from_chain_of_thought_response(raw_string: str) str #
evals.utils module#
- evals.utils.download_benchmark_dataset(task: str, dataset_name: str) DataFrame #
Downloads an Arize evals benchmark dataset as a pandas dataframe.
- Parameters:
task (str) – Task to be performed.
dataset_name (str) – Name of the dataset.
- Returns:
A pandas dataframe containing the data.
- Return type:
pandas.DataFrame
- evals.utils.get_tqdm_progress_bar_formatter(title: str) str #
Returns a progress bar formatter for use with tqdm.
- Parameters:
title (str) – The title of the progress bar, displayed as a prefix.
- Returns:
A formatter to be passed to the bar_format argument of tqdm.
- Return type:
str
- evals.utils.openai_function_call_kwargs(rails: List[str], provide_explanation: bool) Dict[str, Any] #
Returns keyword arguments needed to invoke an OpenAI model with function calling for classification.
- Parameters:
rails (List[str]) – The rails to snap the output to.
provide_explanation (bool) – Whether to provide an explanation.
- Returns:
A dictionary containing function call arguments.
- Return type:
Dict[str, Any]
- evals.utils.parse_openai_function_call(raw_output: str) Tuple[str, str | None] #
Parses the output of an OpenAI function call.
- Parameters:
raw_output (str) – The raw output of an OpenAI function call.
- Returns:
A tuple of the unrailed label and an optional explanation.
- Return type:
Tuple[str, Optional[str]]
- evals.utils.printif(condition: bool, *args: Any, **kwargs: Any) None #
- evals.utils.snap_to_rail(raw_string: str | None, rails: List[str], verbose: bool = False) str #
Snaps a string to the nearest rail, or returns None if the string cannot be snapped to a rail.
- Parameters:
raw_string (str) – An input to be snapped to a rail.
rails (List[str]) – The target set of strings to snap to.
- Returns:
A string from the rails argument or “UNPARSABLE” if the input string could not be snapped.
- Return type:
str
Module contents#
- class evals.AnthropicModel(default_concurrency: int = 20, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, model: str = 'claude-2.1', temperature: float = 0.0, max_tokens: int = 256, top_p: float = 1, top_k: int = 256, stop_sequences: List[str] = <factory>, extra_parameters: Dict[str, Any] = <factory>, max_content_size: Optional[int] = None)#
Bases:
BaseModel
- extra_parameters: Dict[str, Any]#
Any extra parameters to add to the request body (e.g., countPenalty for a21 models)
- invocation_parameters() Dict[str, Any] #
- max_content_size: int | None = None#
If you’re using a fine-tuned model, set this to the maximum content size
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion.
- model: str = 'claude-2.1'#
The model name to use.
- stop_sequences: List[str]#
If the model encounters a stop sequence, it stops generating further tokens.
- temperature: float = 0.0#
What sampling temperature to use.
- top_k: int = 256#
The cutoff where the model no longer selects the words.
- top_p: float = 1#
Total probability mass of tokens to consider at each step.
- class evals.BedrockModel(default_concurrency: int = 20, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, model_id: str = 'anthropic.claude-v2', temperature: float = 0.0, max_tokens: int = 256, top_p: float = 1, top_k: int = 256, stop_sequences: List[str] = <factory>, session: Any = None, client: Any = None, max_content_size: Optional[int] = None, extra_parameters: Dict[str, Any] = <factory>)#
Bases:
BaseModel
- client: Any = None#
The bedrock session client. If unset, a new one is created with boto3.
- extra_parameters: Dict[str, Any]#
Any extra parameters to add to the request body (e.g., countPenalty for a21 models)
- max_content_size: int | None = None#
If you’re using a fine-tuned model, set this to the maximum content size
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion.
- model_id: str = 'anthropic.claude-v2'#
The model name to use.
- session: Any = None#
A bedrock session. If provided, a new bedrock client will be created using this session.
- stop_sequences: List[str]#
If the model encounters a stop sequence, it stops generating further tokens.
- temperature: float = 0.0#
What sampling temperature to use.
- top_k: int = 256#
The cutoff where the model no longer selects the words
- top_p: float = 1#
Total probability mass of tokens to consider at each step.
- class evals.ClassificationTemplate(rails: List[str], template: str, explanation_template: str | None = None, explanation_label_parser: Callable[[str], str] | None = None, delimiters: Tuple[str, str] = ('{', '}'), scores: List[float] | None = None)#
Bases:
PromptTemplate
- extract_label_from_explanation(raw_string: str) str #
- prompt(options: PromptOptions | None = None) str #
- score(rail: str) float #
- class evals.GeminiModel(default_concurrency: int = 5, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, project: Optional[str] = None, location: Optional[str] = None, credentials: Optional[ForwardRef('Credentials')] = None, model: str = 'gemini-pro', temperature: float = 0.0, max_tokens: int = 256, top_p: float = 1, top_k: int = 32, stop_sequences: List[str] = <factory>)#
Bases:
BaseModel
- credentials: Credentials | None = None#
- default_concurrency: int = 5#
- property generation_config: Dict[str, Any]#
- location: str | None = None#
The default location to use when making API calls. If not
- Type:
location (str)
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion.
- model: str = 'gemini-pro'#
The model name to use.
- project: str | None = None#
The default project to use when making API calls.
- Type:
project (str)
- reload_client() None #
- stop_sequences: List[str]#
If the model encounters a stop sequence, it stops generating further tokens.
- temperature: float = 0.0#
What sampling temperature to use.
- top_k: int = 32#
The cutoff where the model no longer selects the words
- top_p: float = 1#
Total probability mass of tokens to consider at each step.
- class evals.HallucinationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is a hallucination given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class evals.LLMEvaluator(model: BaseModel, template: ClassificationTemplate)#
Bases:
object
Leverages an LLM to evaluate individual records.
- async aevaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- property default_concurrency: int#
- evaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
use_function_calling_if_available – If True, use
calling
calling
its
object
parse.
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- reload_client() None #
- class evals.LiteLLMModel(default_concurrency: int = 20, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, model: str = 'gpt-3.5-turbo', temperature: float = 0.0, max_tokens: int = 256, top_p: float = 1, num_retries: int = 0, request_timeout: int = 60, model_kwargs: Dict[str, Any] = <factory>, model_name: Optional[str] = None)#
Bases:
BaseModel
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion.
- model: str = 'gpt-3.5-turbo'#
The model name to use.
- model_kwargs: Dict[str, Any]#
Model specific params
- model_name: str | None = None#
Deprecated since version 3.0.0.
use model instead. This will be removed in a future release.
- num_retries: int = 0#
Maximum number to retry a model if an RateLimitError, OpenAIError, or ServiceUnavailableError occurs.
- request_timeout: int = 60#
Maximum number of seconds to wait when retrying.
- temperature: float = 0.0#
What sampling temperature to use.
- top_p: float = 1#
Total probability mass of tokens to consider at each step.
- class evals.MistralAIModel(*args: Any, **kwargs: Any)#
Bases:
BaseModel
A model class for Mistral AI. Requires mistralai package to be installed.
- invocation_parameters() Dict[str, Any] #
- model: str = 'mistral-large-latest'#
- random_seed: int | None = None#
- response_format: Dict[str, str] | None = None#
- safe_mode: bool = False#
- safe_prompt: bool = False#
- temperature: float = 0#
- top_p: float | None = None#
- class evals.OpenAIModel(default_concurrency: int = 20, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, api_key: Optional[str] = None, organization: Optional[str] = None, base_url: Optional[str] = None, model: str = 'gpt-4', temperature: float = 0.0, max_tokens: int = 256, top_p: float = 1, frequency_penalty: float = 0, presence_penalty: float = 0, n: int = 1, model_kwargs: Dict[str, Any] = <factory>, batch_size: int = 20, request_timeout: Union[float, Tuple[float, float], NoneType] = None, api_version: Optional[str] = None, azure_endpoint: Optional[str] = None, azure_deployment: Optional[str] = None, azure_ad_token: Optional[str] = None, azure_ad_token_provider: Optional[Callable[[], str]] = None, default_headers: Optional[Mapping[str, str]] = None, model_name: Optional[str] = None)#
Bases:
BaseModel
- api_key: str | None = None#
Your OpenAI key. If not provided, will be read from the environment variable
- api_version: str | None = None#
//learn.microsoft.com/en-us/azure/ai-services/openai/reference#rest-api-versioning
- Type:
https
- azure_ad_token: str | None = None#
- azure_ad_token_provider: Callable[[], str] | None = None#
- azure_deployment: str | None = None#
- azure_endpoint: str | None = None#
The endpoint to use for azure openai. Available in the azure portal. https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal#create-a-resource
- base_url: str | None = None#
An optional base URL to use for the OpenAI API. If not provided, will default to what’s configured in OpenAI
- batch_size: int = 20#
Batch size to use when passing multiple documents to generate.
- default_headers: Mapping[str, str] | None = None#
Default headers required by AzureOpenAI
- frequency_penalty: float = 0#
Penalizes repeated tokens according to frequency.
- property invocation_params: Dict[str, Any]#
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion. -1 returns as many tokens as possible given the prompt and the models maximal context size.
- model: str = 'gpt-4'#
Model name to use. In of azure, this is the deployment name such as gpt-35-instant
- model_kwargs: Dict[str, Any]#
Holds any model parameters valid for create call not explicitly specified.
- model_name: str | None = None#
Deprecated since version 3.0.0.
use model instead. This will be removed
- n: int = 1#
How many completions to generate for each prompt.
- organization: str | None = None#
The organization to use for the OpenAI API. If not provided, will default to what’s configured in OpenAI
- presence_penalty: float = 0#
Penalizes repeated tokens.
- property public_invocation_params: Dict[str, Any]#
- reload_client() None #
- request_timeout: float | Tuple[float, float] | None = None#
Timeout for requests to OpenAI completion API. Default is 600 seconds.
- property supports_function_calling: bool#
- temperature: float = 0.0#
What sampling temperature to use.
- top_p: float = 1#
Total probability mass of tokens to consider at each step.
- verbose_generation_info() str #
- class evals.PromptTemplate(template: str, delimiters: Tuple[str, str] = ('{', '}'))#
Bases:
object
- format(variable_values: Mapping[str, bool | int | float | str], options: PromptOptions | None = None) str #
- prompt(options: PromptOptions | None = None) str #
- template: str#
- variables: List[str]#
- class evals.QAEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is correct or incorrect given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class evals.RelevanceEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a retrieved document (stored under a “reference” column) is relevant or irrelevant to the corresponding query (stored under the “input” column).
- class evals.SQLEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a generated SQL query (stored under the “query_gen” column) and a response (stored under the “response” column) appropriately answer a question (stored under the “question” column).
- class evals.SummarizationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a summary (stored under an “output” column) provides an accurate synopsis of an input document (stored under a “input” column).
- class evals.ToxicityEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether the string stored under the “input” column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.
- class evals.VertexAIModel(default_concurrency: int = 20, _verbose: bool = False, _rate_limiter: phoenix.evals.models.rate_limiters.RateLimiter = <factory>, project: Optional[str] = None, location: Optional[str] = None, credentials: Optional[ForwardRef('Credentials')] = None, model: str = 'text-bison', tuned_model: Optional[str] = None, temperature: float = 0.0, max_tokens: int = 256, top_p: float = 0.95, top_k: int = 40, model_name: Optional[str] = None, tuned_model_name: Optional[str] = None)#
Bases:
BaseModel
- credentials: Credentials | None = None#
- property invocation_params: Dict[str, Any]#
- property is_codey_model: bool#
- location: str | None = None#
The default location to use when making API calls. If not
- Type:
location (str)
- max_tokens: int = 256#
The maximum number of tokens to generate in the completion. -1 returns as many tokens as possible given the prompt and the models maximal context size.
- model: str = 'text-bison'#
- model_name: str | None = None#
Deprecated since version 3.0.0.
use model instead. This will be removed in a future release.
- project: str | None = None#
The default project to use when making API calls.
- Type:
project (str)
- temperature: float = 0.0#
What sampling temperature to use.
- top_k: int = 40#
How the model selects tokens for output, the next token is selected from
- top_p: float = 0.95#
Tokens are selected from most probable to least until the sum of their
- tuned_model: str | None = None#
The name of a tuned model. If provided, model is ignored.
- tuned_model_name: str | None = None#
Deprecated since version 3.0.0.
use tuned_model instead. This will be removed in a future release.
- verbose_generation_info() str #
- evals.compute_precisions_at_k(relevance_classifications: List[bool | None]) List[float | None] #
Given a list of relevance classifications, computes precision@k for k = 1, 2, …, n, where n is the length of the input list.
- Parameters:
relevance_classifications (List[Optional[bool]]) – A list of relevance classifications for a set of retrieved documents, sorted by order of retrieval (i.e., the first element is the classification for the first retrieved document, the second element is the classification for the second retrieved document, etc.). The list may contain None values, which indicate that the relevance classification for the corresponding document is unknown.
- Returns:
- A list of precision@k values for k = 1, 2, …, n, where n is the
length of the input list. The first element is the precision@1 value, the second element is the precision@2 value, etc. If the input list contains any None values, those values are omitted when computing the precision@k values.
- Return type:
List[Optional[float]]
- evals.download_benchmark_dataset(task: str, dataset_name: str) DataFrame #
Downloads an Arize evals benchmark dataset as a pandas dataframe.
- Parameters:
task (str) – Task to be performed.
dataset_name (str) – Name of the dataset.
- Returns:
A pandas dataframe containing the data.
- Return type:
pandas.DataFrame
- evals.llm_classify(dataframe: DataFrame, model: BaseModel, template: ClassificationTemplate | PromptTemplate | str, rails: List[str], system_instruction: str | None = None, verbose: bool = False, use_function_calling_if_available: bool = True, provide_explanation: bool = False, include_prompt: bool = False, include_response: bool = False, include_exceptions: bool = False, max_retries: int = 10, exit_on_error: bool = True, run_sync: bool = False, concurrency: int | None = None) DataFrame #
Classifies each input row of the dataframe using an LLM. Returns a pandas.DataFrame where the first column is named label and contains the classification labels. An optional column named explanation is added when provide_explanation=True.
- Parameters:
dataframe (pandas.DataFrame) – A pandas dataframe in which each row represents a record to be
(extra (classified. All template variable names must appear as column names in the dataframe)
permitted). (columns unrelated to the template are)
template (Union[ClassificationTemplate, PromptTemplate, str]) – The prompt template as
PromptTemplate (either an instance of)
string (ClassificationTemplate or a string. If a)
the
made (variable names should be surrounded by curly braces so that a call to .format can be)
values. (to substitute variable)
model (BaseEvalModel) – An LLM model class.
rails (List[str]) – A list of strings representing the possible output classes of the model’s
predictions.
system_instruction (Optional[str], optional) – An optional system message.
verbose (bool, optional) – If True, prints detailed info to stdout such as model invocation
False. (parameters and details about retries and snapping to rails. Default)
use_function_calling_if_available (bool, default=True) – If True, use function calling
calling ((if available) as a means to constrain the LLM outputs. With function)
LLM (the)
object (is instructed to provide its response as a structured JSON)
easier (which is)
parse. (to)
provide_explanation (bool, default=False) – If True, provides an explanation for each
dataframe. (classification label. A column named explanation is added to the output)
include_prompt (bool, default=False) – If True, includes a column named prompt in the
classification. (output dataframe containing the prompt used for each)
include_response (bool, default=False) – If True, includes a column named response in the
LLM. (output dataframe containing the raw response from the)
max_retries (int, optional) – The maximum number of times to retry on exceptions. Defaults to
10.
exit_on_error (bool, default=True) – If True, stops processing evals after all retries are
False (exhausted on a single eval attempt. If)
returning (all evals are attempted before)
:param : :param even if some fail.: :param run_sync: If True, forces synchronous request submission. Otherwise :type run_sync: bool, default=False :param evaluations will be run asynchronously if possible.: :param concurrency: The number of concurrent evals if async :type concurrency: Optional[int], default=None :param submission is possible. If not provided: :param a recommended default concurrency is set on a: :param per-model basis.:
- Returns:
A dataframe where the label column (at column position 0) contains the classification labels. If provide_explanation=True, then an additional column named explanation is added to contain the explanation for each label. The dataframe has the same length and index as the input dataframe. The classification label values are from the entries in the rails argument or “NOT_PARSABLE” if the model’s output could not be parsed. The output dataframe also includes three additional columns in the output dataframe: exceptions, execution_status, and execution_seconds containing details about execution errors that may have occurred during the classification as well as the total runtime of each classification (in seconds).
- Return type:
pandas.DataFrame
- evals.llm_generate(dataframe: DataFrame, template: PromptTemplate | str, model: BaseModel, system_instruction: str | None = None, verbose: bool = False, output_parser: Callable[[str, int], Dict[str, Any]] | None = None, include_prompt: bool = False, include_response: bool = False, run_sync: bool = False, concurrency: int | None = None) DataFrame #
Generates a text using a template using an LLM. This function is useful if you want to generate synthetic data, such as irrelevant responses :param dataframe: A pandas dataframe in which each row :type dataframe: pandas.DataFrame :param represents a record to be used as in input to the template. All: :param template variable names must appear as column names in the dataframe: :param (extra columns unrelated to the template are permitted).: :param template: The prompt template as either an :type template: Union[PromptTemplate, str] :param instance of PromptTemplate or a string. If the latter: :param the variable: :param names should be surrounded by curly braces so that a call to .format: :param can be made to substitute variable values.: :param model: An LLM model class. :type model: BaseEvalModel :param system_instruction: An optional system :type system_instruction: Optional[str], optional :param message.: :param verbose: If True, prints detailed information to stdout such as model :type verbose: bool, optional :param invocation parameters and retry info. Default False.: :param output_parser: An optional function :type output_parser: Callable[[str, int], Dict[str, Any]], optional :param that takes each generated response and response index and parses it to a dictionary. The: :param keys of the dictionary should correspond to the column names of the output dataframe. If: :param None: :param the output dataframe will have a single column named “output”. Default None.: :param include_prompt: If True, includes a column named prompt in the :type include_prompt: bool, default=False :param output dataframe containing the prompt used for each generation.: :param include_response: If True, includes a column named response in the :type include_response: bool, default=False :param output dataframe containing the raw response from the LLM prior to applying the output: :param parser.: :param run_sync: If True, forces synchronous request submission. Otherwise :type run_sync: bool, default=False :param evaluations will be run asynchronously if possible.: :param concurrency: The number of concurrent evals if async :type concurrency: Optional[int], default=None :param submission is possible. If not provided: :param a recommended default concurrency is set on a: :param per-model basis.:
- Returns:
A dataframe where each row represents the generated output
- Return type:
generations_dataframe (pandas.DataFrame)
- evals.run_evals(dataframe: DataFrame, evaluators: List[LLMEvaluator], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False, concurrency: int | None = None) List[DataFrame] #
Applies a list of evaluators to a dataframe. Outputs a list of dataframes in which each dataframe contains the outputs of the corresponding evaluator applied to the input dataframe.
- Parameters:
dataframe (DataFrame) – A pandas dataframe in which each row represents a
as (the LLM is instructed to provide its response)
template (column names in the dataframe (extra columns unrelated to the)
permitted). (are)
evaluators (List[LLMEvaluator]) – A list of evaluators.
provide_explanation (bool, optional) – If True, provides an explanation
each (for each evaluation. A column named "explanation" is added to)
dataframe. (output)
use_function_calling_if_available (bool, optional) – If True, use
calling (With function)
calling
as
object (a structured JSON)
parse. (which is easier to)
verbose (bool, optional) – If True, prints detailed info to stdout such
to (as model invocation parameters and details about retries and snapping)
rails.
concurrency (Optional[int], default=None) – The number of concurrent evals if async
provided (submission is possible. If not)
a (a recommended default concurrency is set on)
basis. (per-model)
- Returns:
A list of dataframes, one for each evaluator, all of which have the same number of rows as the input dataframe.
- Return type:
List[DataFrame]