指标

MetricType

Bases: Enum

Enumeration of metric types in Ragas.

Attributes:

Name	Type	Description
`SINGLE_TURN`	`str`	Represents a single-turn metric type.
`MULTI_TURN`	`str`	Represents a multi-turn metric type.

Metric `dataclass`

Metric(
    _required_columns: Dict[MetricType, Set[str]] = dict()
)

Bases: ABC

Abstract base class for metrics in Ragas.

Attributes:

Name	Type	Description
`name`	`str`	The name of the metric.
`required_columns`	`Dict[str, Set[str]]`	A dictionary mapping metric type names to sets of required column names. This is a property and raises `ValueError` if columns are not in `VALID_COLUMNS`.

score

score(row: Dict, callbacks: Callbacks = None) -> float

Calculates the score for a single row of data.

Note

This method is deprecated and will be removed in 0.3. Please use single_turn_ascore or multi_turn_ascore instead.

Source code in src/ragas/metrics/base.py

@deprecated("0.2", removal="0.3", alternative="single_turn_ascore")
def score(self: t.Self, row: t.Dict, callbacks: Callbacks = None) -> float:
    """
    Calculates the score for a single row of data.

    Note
    ----
    This method is deprecated and will be removed in 0.3. Please use `single_turn_ascore` or `multi_turn_ascore` instead.
    """
    callbacks = callbacks or []
    rm, group_cm = new_group(self.name, inputs=row, callbacks=callbacks)
    try:
        if is_event_loop_running():
            try:
                import nest_asyncio

                nest_asyncio.apply()
            except ImportError:
                raise ImportError(
                    "It seems like your running this in a jupyter-like environment. Please install nest_asyncio with `pip install nest_asyncio` to make it work."
                )
        loop = asyncio.get_event_loop()
        score = loop.run_until_complete(self._ascore(row=row, callbacks=group_cm))
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

ascore `async`

ascore(
    row: Dict,
    callbacks: Callbacks = None,
    timeout: Optional[float] = None,
) -> float

Asynchronously calculates the score for a single row of data.

Note

This method is deprecated and will be removed in 0.3. Please use single_turn_ascore instead.

Source code in src/ragas/metrics/base.py

@deprecated("0.2", removal="0.3", alternative="single_turn_ascore")
async def ascore(
    self: t.Self,
    row: t.Dict,
    callbacks: Callbacks = None,
    timeout: t.Optional[float] = None,
) -> float:
    """
    Asynchronously calculates the score for a single row of data.

    Note
    ----
    This method is deprecated and will be removed in 0.3. Please use `single_turn_ascore` instead.
    """
    callbacks = callbacks or []
    rm, group_cm = new_group(self.name, inputs=row, callbacks=callbacks)
    try:
        score = await asyncio.wait_for(
            self._ascore(row=row, callbacks=group_cm),
            timeout=timeout,
        )
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

MetricWithLLM `dataclass`

MetricWithLLM(
    _required_columns: Dict[MetricType, Set[str]] = dict(),
    llm: Optional[BaseRagasLLM] = None,
)

Bases: Metric, PromptMixin

A metric class that uses a language model for evaluation.

Attributes:

Name	Type	Description
`llm`	`Optional[BaseRagasLLM]`	The language model used for the metric.

SingleTurnMetric `dataclass`

SingleTurnMetric(
    _required_columns: Dict[MetricType, Set[str]] = dict()
)

Bases: Metric

A metric class for evaluating single-turn interactions.

This class provides methods to score single-turn samples, both synchronously and asynchronously.

single_turn_score

single_turn_score(
    sample: SingleTurnSample, callbacks: Callbacks = None
) -> float

Synchronously score a single-turn sample.

May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.

Source code in src/ragas/metrics/base.py

def single_turn_score(
    self,
    sample: SingleTurnSample,
    callbacks: Callbacks = None,
) -> float:
    """
    Synchronously score a single-turn sample.

    May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
    """
    callbacks = callbacks or []
    rm, group_cm = new_group(
        self.name, inputs=sample.model_dump(), callbacks=callbacks
    )
    try:
        if is_event_loop_running():
            try:
                import nest_asyncio

                nest_asyncio.apply()
            except ImportError:
                raise ImportError(
                    "It seems like your running this in a jupyter-like environment. Please install nest_asyncio with `pip install nest_asyncio` to make it work."
                )
        loop = asyncio.get_event_loop()
        score = loop.run_until_complete(
            self._single_turn_ascore(sample=sample, callbacks=group_cm)
        )
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

single_turn_ascore `async`

single_turn_ascore(
    sample: SingleTurnSample,
    callbacks: Callbacks = None,
    timeout: Optional[float] = None,
) -> float

Asynchronously score a single-turn sample with an optional timeout.

May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.

Source code in src/ragas/metrics/base.py

async def single_turn_ascore(
    self,
    sample: SingleTurnSample,
    callbacks: Callbacks = None,
    timeout: t.Optional[float] = None,
) -> float:
    """
    Asynchronously score a single-turn sample with an optional timeout.

    May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
    """
    callbacks = callbacks or []
    row = sample.model_dump()
    rm, group_cm = new_group(self.name, inputs=row, callbacks=callbacks)
    try:
        score = await asyncio.wait_for(
            self._single_turn_ascore(sample=sample, callbacks=group_cm),
            timeout=timeout,
        )
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

MultiTurnMetric `dataclass`

MultiTurnMetric(
    _required_columns: Dict[MetricType, Set[str]] = dict()
)

Bases: Metric

A metric class for evaluating multi-turn conversations.

This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.

multi_turn_score

multi_turn_score(
    sample: MultiTurnSample, callbacks: Callbacks = None
) -> float

Score a multi-turn conversation sample synchronously.

May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.

Source code in src/ragas/metrics/base.py

def multi_turn_score(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks = None,
) -> float:
    """
    Score a multi-turn conversation sample synchronously.

    May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
    """
    callbacks = callbacks or []
    rm, group_cm = new_group(
        self.name, inputs=sample.model_dump(), callbacks=callbacks
    )
    try:
        if is_event_loop_running():
            try:
                import nest_asyncio

                nest_asyncio.apply()
            except ImportError:
                raise ImportError(
                    "It seems like your running this in a jupyter-like environment. Please install nest_asyncio with `pip install nest_asyncio` to make it work."
                )
        loop = asyncio.get_event_loop()
        score = loop.run_until_complete(
            self._multi_turn_ascore(sample=sample, callbacks=group_cm)
        )
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

multi_turn_ascore `async`

multi_turn_ascore(
    sample: MultiTurnSample,
    callbacks: Callbacks = None,
    timeout: Optional[float] = None,
) -> float

Score a multi-turn conversation sample asynchronously.

May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.

Source code in src/ragas/metrics/base.py

async def multi_turn_ascore(
    self,
    sample: MultiTurnSample,
    callbacks: Callbacks = None,
    timeout: t.Optional[float] = None,
) -> float:
    """
    Score a multi-turn conversation sample asynchronously.

    May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
    """
    callbacks = callbacks or []
    rm, group_cm = new_group(
        self.name, inputs=sample.model_dump(), callbacks=callbacks
    )
    try:
        score = await asyncio.wait_for(
            self._multi_turn_ascore(sample=sample, callbacks=group_cm),
            timeout=timeout,
        )
    except Exception as e:
        if not group_cm.ended:
            rm.on_chain_error(e)
        raise e
    else:
        if not group_cm.ended:
            rm.on_chain_end({"output": score})
    return score

Ensember

Combine multiple llm outputs for same input (n>1) to a single output

from_discrete

from_discrete(
    inputs: list[list[Dict]], attribute: str
) -> List[Dict]

Simple majority voting for binary values, ie [0,0,1] -> 0 inputs: list of list of dicts each containing verdict for a single input

Source code in src/ragas/metrics/base.py

def from_discrete(
    self, inputs: list[list[t.Dict]], attribute: str
) -> t.List[t.Dict]:
    """
    Simple majority voting for binary values, ie [0,0,1] -> 0
    inputs: list of list of dicts each containing verdict for a single input
    """

    if not isinstance(inputs, list):
        inputs = [inputs]

    if not all(len(item) == len(inputs[0]) for item in inputs):
        logger.warning("All inputs must have the same length")
        return inputs[0]

    if not all(attribute in item for input in inputs for item in input):
        logger.warning(f"All inputs must have {attribute} attribute")
        return inputs[0]

    if len(inputs) == 1:
        return inputs[0]

    verdict_agg = []
    for i in range(len(inputs[0])):
        item = inputs[0][i]
        verdicts = [inputs[k][i][attribute] for k in range(len(inputs))]
        verdict_counts = dict(Counter(verdicts).most_common())
        item[attribute] = list(verdict_counts.keys())[0]
        verdict_agg.append(item)

    return verdict_agg

get_segmenter

get_segmenter(
    language: str = "english",
    clean: bool = False,
    char_span: bool = False,
)

Get a sentence segmenter for a given language

Source code in src/ragas/metrics/base.py

def get_segmenter(
    language: str = "english", clean: bool = False, char_span: bool = False
):
    """
    Get a sentence segmenter for a given language
    """
    language = language.lower()
    if language not in RAGAS_SUPPORTED_LANGUAGE_CODES:
        raise ValueError(
            f"Language '{language}' not supported. Supported languages: {RAGAS_SUPPORTED_LANGUAGE_CODES.keys()}"
        )
    return Segmenter(
        language=RAGAS_SUPPORTED_LANGUAGE_CODES[language],
        clean=clean,
        char_span=char_span,
    )

is_reproducable

is_reproducable(metric: Metric) -> bool

Check if a metric is reproducible by checking if it has a _reproducibility attribute.

Source code in src/ragas/metrics/base.py

def is_reproducable(metric: Metric) -> bool:
    """
    Check if a metric is reproducible by checking if it has a `_reproducibility` attribute.
    """
    return hasattr(metric, "_reproducibility")

AnswerCorrectness `dataclass`

AnswerCorrectness(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {"user_input", "response", "reference"}
    }(),
    embeddings: Optional[BaseRagasEmbeddings] = None,
    llm: Optional[BaseRagasLLM] = None,
    name: str = "answer_correctness",
    correctness_prompt: PydanticPrompt = CorrectnessClassifier(),
    long_form_answer_prompt: PydanticPrompt = LongFormAnswerPrompt(),
    weights: list[float] = lambda: [0.75, 0.25](),
    answer_similarity: Optional[AnswerSimilarity] = None,
    sentence_segmenter: Optional[HasSegmentMethod] = None,
    max_retries: int = 1,
)

Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric

Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.

Attributes:

Name	Type	Description
`name`	`string`	The name of the metrics
`weights`	`list[float]`	a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25]
`answer_similarity`	`Optional[AnswerSimilarity]`	The AnswerSimilarity object

ResponseRelevancy `dataclass`

ResponseRelevancy(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {SINGLE_TURN: {"user_input", "response"}}(),
    embeddings: Optional[BaseRagasEmbeddings] = None,
    llm: Optional[BaseRagasLLM] = None,
    name: str = "answer_relevancy",
    question_generation: PydanticPrompt = ResponseRelevancePrompt(),
    strictness: int = 3,
)

Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric

Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.

Attributes:

Name	Type	Description
`name`	`string`	The name of the metrics
`strictness`	`int`	Here indicates the number questions generated per answer. Ideal range between 3 to 5.
`embeddings`	`Embedding`	The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings('BAAI/bge-base-en')

SemanticSimilarity `dataclass`

SemanticSimilarity(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {SINGLE_TURN: {"reference", "response"}}(),
    embeddings: Optional[BaseRagasEmbeddings] = None,
    llm: Optional[BaseRagasLLM] = None,
    name: str = "semantic_similarity",
    is_cross_encoder: bool = False,
    threshold: Optional[float] = None,
)

Bases: MetricWithLLM, MetricWithEmbeddings, SingleTurnMetric

Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf

Attributes:

Name	Type	Description
`name`	`str`
`model_name`		The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard
`threshold`	`Optional[float]`	The threshold if given used to map output to binary Default 0.5

AspectCritic `dataclass`

AspectCritic(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {SINGLE_TURN: {"user_input", "response"}}(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "",
    single_turn_prompt: PydanticPrompt = lambda: SingleTurnAspectCriticPrompt()(),
    multi_turn_prompt: PydanticPrompt = lambda: MultiTurnAspectCriticPrompt()(),
    definition: str = "",
    strictness: int = 1,
    max_retries: int = 1,
)

Bases: MetricWithLLM, SingleTurnMetric, MultiTurnMetric

Judges the submission to give binary results using the criteria specified in the metric definition.

Attributes:

Name	Type	Description
`name`	`str`	name of the metrics
`definition`	`str`	criteria to judge the submission, example "Is the submission spreading fake information?"
`strictness`	`int`	The number of times self consistency checks is made. Final judgement is made using majority vote.

ContextEntityRecall `dataclass`

ContextEntityRecall(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {"reference", "retrieved_contexts"}
    }(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "context_entity_recall",
    context_entity_recall_prompt: PydanticPrompt = ExtractEntitiesPrompt(),
    max_retries: int = 1,
)

Bases: MetricWithLLM, SingleTurnMetric

Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.

Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |

If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.

Attributes:

Name	Type	Description
`name`	`str`
`batch_size`	`int`	Batch size for openai completion.

LLMContextRecall `dataclass`

LLMContextRecall(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {
            "user_input",
            "retrieved_contexts",
            "reference",
        }
    }(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "context_recall",
    context_recall_prompt: PydanticPrompt = ContextRecallClassificationPrompt(),
    max_retries: int = 1,
    _reproducibility: int = 1,
)

Bases: MetricWithLLM, SingleTurnMetric

Estimates context recall by estimating TP and FN using annotated answer and retrieved context.

Attributes:

Name	Type	Description
`name`	`str`

Faithfulness `dataclass`

Faithfulness(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {
            "user_input",
            "response",
            "retrieved_contexts",
        }
    }(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "faithfulness",
    nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
    statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
    sentence_segmenter: Optional[HasSegmentMethod] = None,
    max_retries: int = 1,
    _reproducibility: int = 1,
)

Bases: MetricWithLLM, SingleTurnMetric

FaithfulnesswithHHEM `dataclass`

FaithfulnesswithHHEM(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {
            "user_input",
            "response",
            "retrieved_contexts",
        }
    }(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "faithfulness_with_hhem",
    nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
    statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
    sentence_segmenter: Optional[HasSegmentMethod] = None,
    max_retries: int = 1,
    _reproducibility: int = 1,
    device: str = "cpu",
    batch_size: int = 10,
)

Bases: Faithfulness

NoiseSensitivity `dataclass`

NoiseSensitivity(
    _required_columns: Dict[
        MetricType, Set[str]
    ] = lambda: {
        SINGLE_TURN: {
            "user_input",
            "response",
            "reference",
            "retrieved_contexts",
        }
    }(),
    llm: Optional[BaseRagasLLM] = None,
    name: str = "noise_sensitivity",
    focus: Literal["relevant", "irrelevant"] = "relevant",
    nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
    statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
    sentence_segmenter: Optional[HasSegmentMethod] = None,
    max_retries: int = 1,
    _reproducibility: int = 1,
)

Bases: MetricWithLLM, SingleTurnMetric

指标

MetricType

Metric dataclass

score

ascore async

MetricWithLLM dataclass

SingleTurnMetric dataclass

single_turn_score

single_turn_ascore async

MultiTurnMetric dataclass

multi_turn_score

multi_turn_ascore async

Ensember

from_discrete

get_segmenter

is_reproducable

AnswerCorrectness dataclass

ResponseRelevancy dataclass

SemanticSimilarity dataclass

AspectCritic dataclass

ContextEntityRecall dataclass

LLMContextRecall dataclass

Faithfulness dataclass

FaithfulnesswithHHEM dataclass

NoiseSensitivity dataclass

Metric `dataclass`

ascore `async`

MetricWithLLM `dataclass`

SingleTurnMetric `dataclass`

single_turn_ascore `async`

MultiTurnMetric `dataclass`

multi_turn_ascore `async`

AnswerCorrectness `dataclass`

ResponseRelevancy `dataclass`

SemanticSimilarity `dataclass`

AspectCritic `dataclass`

ContextEntityRecall `dataclass`

LLMContextRecall `dataclass`

Faithfulness `dataclass`

FaithfulnesswithHHEM `dataclass`

NoiseSensitivity `dataclass`