evals.evaluators#

class HallucinationEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a response (stored under an “output” column) is a hallucination given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).

class LLMEvaluator(model, template)#

Bases: object

Leverages an LLM to evaluate individual records.

async aevaluate(record, provide_explanation=False, use_function_calling_if_available=True, verbose=False)#

Evaluates a single record.

Parameters:
  • record (Record) – The record to evaluate.

  • provide_explanation (bool, optional) – Whether to provide an explanation.

  • use_function_calling_if_available (bool, optional) – If True, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.

  • verbose (bool, optional) – Whether to print verbose output.

Returns:

A tuple containing:
  • label

  • score (if scores for each label are specified by the template)

  • explanation (if requested)

Return type:

Tuple[str, Optional[float], Optional[str]]

evaluate(record, provide_explanation=False, use_function_calling_if_available=True, verbose=False)#

Evaluates a single record.

Parameters:
  • record (Record) – The record to evaluate.

  • provide_explanation (bool, optional) – Whether to provide an explanation.

  • use_function_calling_if_available (bool, optional) – If True, use function calling (if available) as a means to constrain the LLM outputs. With function calling, the LLM is instructed to provide its response as a structured JSON object, which is easier to parse.

  • verbose (bool, optional) – Whether to print verbose output.

Returns:

A tuple containing:
  • label

  • score (if scores for each label are specified by the template)

  • explanation (if requested)

Return type:

Tuple[str, Optional[float], Optional[str]]

class QAEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a response (stored under an “output” column) is correct or incorrect given a query (stored under an “input” column) and one or more retrieved documents (stored under a “reference” column).

class RelevanceEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a retrieved document (stored under a “reference” column) is relevant or irrelevant to the corresponding query (stored under the “input” column).

class SQLEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a generated SQL query (stored under the “query_gen” column) and a response (stored under the “response” column) appropriately answer a question (stored under the “question” column).

class SummarizationEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether a summary (stored under an “output” column) provides an accurate synopsis of an input document (stored under a “input” column).

class ToxicityEvaluator(model)#

Bases: LLMEvaluator

Leverages an LLM to evaluate whether the string stored under the “input” column contains racist, sexist, chauvinistic, biased, or otherwise toxic content.