evals.evaluators#

class HallucinationEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a response (stored under an “output” column) is a hallucination given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).

class LLMEvaluator(model: BaseModel, template: ClassificationTemplate)#

Bases: object

Leverages an LLM to evaluate individual records.

async aevaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None]#

Evaluates a single record.

Parameters:
  • record (Record) – The record to evaluate.

  • provide_explanation (bool, optional) – Whether to provide an

  • explanation.

  • use_function_calling_if_available (bool, optional) – If True, use

  • calling (outputs. With function)

  • calling

  • its (the LLM is instructed to provide)

  • object (response as a structured JSON)

  • parse. (which is easier to)

  • verbose (bool, optional) – Whether to print verbose output.

Returns:

A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)

Return type:

Tuple[str, Optional[float], Optional[str]]

property default_concurrency: int#
evaluate(record: Mapping[str, str], provide_explanation: bool = False, use_function_calling_if_available: bool = True, verbose: bool = False) Tuple[str, float | None, str | None]#

Evaluates a single record.

Parameters:
  • record (Record) – The record to evaluate.

  • provide_explanation (bool, optional) – Whether to provide an

  • explanation.

  • use_function_calling_if_available (bool, optional) – If True, use

  • calling (outputs. With function)

  • calling

  • its (the LLM is instructed to provide)

  • object (response as a structured JSON)

  • parse. (which is easier to)

  • use_function_calling_if_available – If True, use

  • calling

  • calling

  • its

  • object

  • parse.

  • verbose (bool, optional) – Whether to print verbose output.

Returns:

A tuple containing: - label - score (if scores for each label are specified by the template) - explanation (if requested)

Return type:

Tuple[str, Optional[float], Optional[str]]

reload_client() None#
class MapReducer(model: BaseModel, map_prompt_template: PromptTemplate, reduce_prompt_template: PromptTemplate)#

Bases: object

Evaluates data that is too large to fit into a single context window using a map-reduce strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. Each chunk of data is individually evaluated (the “map” step), producing intermediate outputs that are combined into a single result (the “reduce” step).

This is the simplest strategy for evaluating long-context data.

evaluate(chunks: List[str]) str#

Evaluates a list of two or more chunks.

Parameters:
  • chunks (List[str]) – A list of chunks to be evaluated. Each chunk is

  • within (inserted into the map_prompt_template and must therefore fit)

  • the (the LLM's context window and still leave room for the rest of)

  • prompt.

Returns:

The output of the map-reduce process.

Return type:

str

class QAEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a response (stored under an “output” column) is correct or incorrect given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).

class Refiner(model: BaseModel, initial_prompt_template: PromptTemplate, refine_prompt_template: PromptTemplate, synthesize_prompt_template: PromptTemplate | None = None)#

Bases: object

Evaluates data that is too large to fit into a single context window using a refine strategy. The data must first be divided into “chunks” that individually fit into an LLM’s context window. An initial “accumulator” is generated from the first chunk of data. The accumulator is subsequently refined by iteratively updating and incorporating new information from each subsequent chunk. An optional synthesis step can be used to synthesize the final accumulator into a desired format.

evaluate(chunks: List[str]) str#

Evaluates a list of two or more chunks.

Parameters:
  • chunks (List[str]) – A list of chunks to be evaluated. Each chunk is

  • refine_prompt_template (inserted into the initial_prompt_template and)

  • still (and must therefore fit within the LLM's context window and)

  • prompt. (leave room for the rest of the)

Returns:

The output of the refine process.

Return type:

str

class RelevanceEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a retrieved document (stored under a “reference” column) is relevant or irrelevant to the corresponding query (stored under the “input” column).

class SQLEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a generated SQL query (stored under the “query_gen” column) and a response (stored under the “response” column) appropriately answer a question (stored under the “question” column).

class SummarizationEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a summary (stored under an “output” column) provides an accurate synopsis of an input document (stored under a “input” column).

class ToxicityEvaluator(model: BaseModel)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether the string stored under the “input” column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.