evals.evaluators#
- class HallucinationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is a hallucination given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class LLMEvaluator(model: BaseModel, template: ClassificationTemplate)#
Bases:
object
Leverages an LLM to evaluate individual records.
- async aevaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- property default_concurrency: int#
- evaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None] #
Evaluates a single record.
- Parameters:
record (Record) – The record to evaluate.
provide_explanation (bool, optional) – Whether to provide an
explanation.
use_function_calling_if_available (bool, optional) – If True, use
calling (outputs. With function)
calling
its (the LLM is instructed to provide)
object (response as a structured JSON)
parse. (which is easier to)
use_function_calling_if_available – If True, use
calling
calling
its
object
parse.
verbose (bool, optional) – Whether to print verbose output.
- Returns:
A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)
- Return type:
Tuple[str, Optional[float], Optional[str]]
- reload_client() None #
- class MapReducer(model: BaseModel, map_prompt_template: PromptTemplate, reduce_prompt_template: PromptTemplate)#
Bases:
object
Evaluates data that is too large to fit into a single context window using a map-reduce strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. Each chunk of data is individually evaluated (the “map” step), producing intermediate outputs that are combined into a single result (the “reduce” step).
This is the simplest strategy for evaluating long-context data.
- evaluate(chunks: List[str]) str #
Evaluates a list of two or more chunks.
- Parameters:
chunks (List[str]) – A list of chunks to be evaluated. Each chunk is
within (inserted into the map_prompt_template and must therefore fit)
the (the LLM's context window and still leave room for the rest of)
prompt.
- Returns:
The output of the map-reduce process.
- Return type:
str
- class QAEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a response (stored under an “output” column) is correct or incorrect given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).
- class Refiner(model: BaseModel, initial_prompt_template: PromptTemplate, refine_prompt_template: PromptTemplate, synthesize_prompt_template: PromptTemplate | None = None)#
Bases:
object
Evaluates data that is too large to fit into a single context window using a refine strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. An initial “accumulator” is generated from the first chunk of data. The accumulator is subsequently refined by iteratively updating and incorporating new information from each subsequent chunk. An optional synthesis step can be used to synthesize the final accumulator into a desired format.
- evaluate(chunks: List[str]) str #
Evaluates a list of two or more chunks.
- Parameters:
chunks (List[str]) – A list of chunks to be evaluated. Each chunk is
refine_prompt_template (inserted into the initial_prompt_template and)
still (and must therefore fit within the LLM's context window and)
prompt. (leave room for the rest of the)
- Returns:
The output of the refine process.
- Return type:
str
- class RelevanceEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a retrieved document (stored under a “reference” column) is relevant or irrelevant to the corresponding query (stored under the “input” column).
- class SQLEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a generated SQL query (stored under the “query_gen” column) and a response (stored under the “response” column) appropriately answer a question (stored under the “question” column).
- class SummarizationEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether a summary (stored under an “output” column) provides an accurate synopsis of an input document (stored under a “input” column).
- class ToxicityEvaluator(model: BaseModel)#
Bases:
LLMEvaluator
Leverages an LLM to evaluate whether the string stored under the “input” column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.