指标
MetricType
Bases: Enum
Enumeration of metric types in Ragas.
Attributes:
Name | Type | Description |
---|---|---|
SINGLE_TURN |
str
|
Represents a single-turn metric type. |
MULTI_TURN |
str
|
Represents a multi-turn metric type. |
Metric
dataclass
Metric(
_required_columns: Dict[MetricType, Set[str]] = dict()
)
Bases: ABC
Abstract base class for metrics in Ragas.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
The name of the metric. |
required_columns |
Dict[str, Set[str]]
|
A dictionary mapping metric type names to sets of required column names. This is
a property and raises |
score
Calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore
or multi_turn_ascore
instead.
Source code in src/ragas/metrics/base.py
ascore
async
Asynchronously calculates the score for a single row of data.
Note
This method is deprecated and will be removed in 0.3. Please use single_turn_ascore
instead.
Source code in src/ragas/metrics/base.py
MetricWithLLM
dataclass
MetricWithLLM(
_required_columns: Dict[MetricType, Set[str]] = dict(),
llm: Optional[BaseRagasLLM] = None,
)
Bases: Metric
, PromptMixin
A metric class that uses a language model for evaluation.
Attributes:
Name | Type | Description |
---|---|---|
llm |
Optional[BaseRagasLLM]
|
The language model used for the metric. |
SingleTurnMetric
dataclass
SingleTurnMetric(
_required_columns: Dict[MetricType, Set[str]] = dict()
)
Bases: Metric
A metric class for evaluating single-turn interactions.
This class provides methods to score single-turn samples, both synchronously and asynchronously.
single_turn_score
single_turn_score(
sample: SingleTurnSample, callbacks: Callbacks = None
) -> float
Synchronously score a single-turn sample.
May raise ImportError if nest_asyncio is not installed in a Jupyter-like environment.
Source code in src/ragas/metrics/base.py
single_turn_ascore
async
single_turn_ascore(
sample: SingleTurnSample,
callbacks: Callbacks = None,
timeout: Optional[float] = None,
) -> float
Asynchronously score a single-turn sample with an optional timeout.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
MultiTurnMetric
dataclass
MultiTurnMetric(
_required_columns: Dict[MetricType, Set[str]] = dict()
)
Bases: Metric
A metric class for evaluating multi-turn conversations.
This class extends the base Metric class to provide functionality for scoring multi-turn conversation samples.
multi_turn_score
multi_turn_score(
sample: MultiTurnSample, callbacks: Callbacks = None
) -> float
Score a multi-turn conversation sample synchronously.
May raise ImportError if nest_asyncio is not installed in Jupyter-like environments.
Source code in src/ragas/metrics/base.py
multi_turn_ascore
async
multi_turn_ascore(
sample: MultiTurnSample,
callbacks: Callbacks = None,
timeout: Optional[float] = None,
) -> float
Score a multi-turn conversation sample asynchronously.
May raise asyncio.TimeoutError if the scoring process exceeds the specified timeout.
Source code in src/ragas/metrics/base.py
Ensember
Combine multiple llm outputs for same input (n>1) to a single output
from_discrete
Simple majority voting for binary values, ie [0,0,1] -> 0 inputs: list of list of dicts each containing verdict for a single input
Source code in src/ragas/metrics/base.py
get_segmenter
Get a sentence segmenter for a given language
Source code in src/ragas/metrics/base.py
is_reproducable
is_reproducable(metric: Metric) -> bool
Check if a metric is reproducible by checking if it has a _reproducibility
attribute.
AnswerCorrectness
dataclass
AnswerCorrectness(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {"user_input", "response", "reference"}
}(),
embeddings: Optional[BaseRagasEmbeddings] = None,
llm: Optional[BaseRagasLLM] = None,
name: str = "answer_correctness",
correctness_prompt: PydanticPrompt = CorrectnessClassifier(),
long_form_answer_prompt: PydanticPrompt = LongFormAnswerPrompt(),
weights: list[float] = lambda: [0.75, 0.25](),
answer_similarity: Optional[AnswerSimilarity] = None,
sentence_segmenter: Optional[HasSegmentMethod] = None,
max_retries: int = 1,
)
Bases: MetricWithLLM
, MetricWithEmbeddings
, SingleTurnMetric
Measures answer correctness compared to ground truth as a combination of factuality and semantic similarity.
Attributes:
Name | Type | Description |
---|---|---|
name |
string
|
The name of the metrics |
weights |
list[float]
|
a list of two weights corresponding to factuality and semantic similarity Defaults [0.75, 0.25] |
answer_similarity |
Optional[AnswerSimilarity]
|
The AnswerSimilarity object |
ResponseRelevancy
dataclass
ResponseRelevancy(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {SINGLE_TURN: {"user_input", "response"}}(),
embeddings: Optional[BaseRagasEmbeddings] = None,
llm: Optional[BaseRagasLLM] = None,
name: str = "answer_relevancy",
question_generation: PydanticPrompt = ResponseRelevancePrompt(),
strictness: int = 3,
)
Bases: MetricWithLLM
, MetricWithEmbeddings
, SingleTurnMetric
Scores the relevancy of the answer according to the given question. Answers with incomplete, redundant or unnecessary information is penalized. Score can range from 0 to 1 with 1 being the best.
Attributes:
Name | Type | Description |
---|---|---|
name |
string
|
The name of the metrics |
strictness |
int
|
Here indicates the number questions generated per answer. Ideal range between 3 to 5. |
embeddings |
Embedding
|
The langchain wrapper of Embedding object. E.g. HuggingFaceEmbeddings('BAAI/bge-base-en') |
SemanticSimilarity
dataclass
SemanticSimilarity(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {SINGLE_TURN: {"reference", "response"}}(),
embeddings: Optional[BaseRagasEmbeddings] = None,
llm: Optional[BaseRagasLLM] = None,
name: str = "semantic_similarity",
is_cross_encoder: bool = False,
threshold: Optional[float] = None,
)
Bases: MetricWithLLM
, MetricWithEmbeddings
, SingleTurnMetric
Scores the semantic similarity of ground truth with generated answer. cross encoder score is used to quantify semantic similarity. SAS paper: https://arxiv.org/pdf/2108.06130.pdf
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
|
model_name |
The model to be used for calculating semantic similarity Defaults open-ai-embeddings select cross-encoder model for best results https://huggingface.co/spaces/mteb/leaderboard |
|
threshold |
Optional[float]
|
The threshold if given used to map output to binary Default 0.5 |
AspectCritic
dataclass
AspectCritic(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {SINGLE_TURN: {"user_input", "response"}}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "",
single_turn_prompt: PydanticPrompt = lambda: SingleTurnAspectCriticPrompt()(),
multi_turn_prompt: PydanticPrompt = lambda: MultiTurnAspectCriticPrompt()(),
definition: str = "",
strictness: int = 1,
max_retries: int = 1,
)
Bases: MetricWithLLM
, SingleTurnMetric
, MultiTurnMetric
Judges the submission to give binary results using the criteria specified in the metric definition.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
name of the metrics |
definition |
str
|
criteria to judge the submission, example "Is the submission spreading fake information?" |
strictness |
int
|
The number of times self consistency checks is made. Final judgement is made using majority vote. |
ContextEntityRecall
dataclass
ContextEntityRecall(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {"reference", "retrieved_contexts"}
}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "context_entity_recall",
context_entity_recall_prompt: PydanticPrompt = ExtractEntitiesPrompt(),
max_retries: int = 1,
)
Bases: MetricWithLLM
, SingleTurnMetric
Calculates recall based on entities present in ground truth and context. Let CN be the set of entities present in context, GN be the set of entities present in the ground truth.
Then we define can the context entity recall as follows: Context Entity recall = | CN ∩ GN | / | GN |
If this quantity is 1, we can say that the retrieval mechanism has retrieved context which covers all entities present in the ground truth, thus being a useful retrieval. Thus this can be used to evaluate retrieval mechanisms in specific use cases where entities matter, for example, a tourism help chatbot.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
|
batch_size |
int
|
Batch size for openai completion. |
LLMContextRecall
dataclass
LLMContextRecall(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {
"user_input",
"retrieved_contexts",
"reference",
}
}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "context_recall",
context_recall_prompt: PydanticPrompt = ContextRecallClassificationPrompt(),
max_retries: int = 1,
_reproducibility: int = 1,
)
Bases: MetricWithLLM
, SingleTurnMetric
Estimates context recall by estimating TP and FN using annotated answer and retrieved context.
Attributes:
Name | Type | Description |
---|---|---|
name |
str
|
|
Faithfulness
dataclass
Faithfulness(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {
"user_input",
"response",
"retrieved_contexts",
}
}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "faithfulness",
nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
sentence_segmenter: Optional[HasSegmentMethod] = None,
max_retries: int = 1,
_reproducibility: int = 1,
)
Bases: MetricWithLLM
, SingleTurnMetric
FaithfulnesswithHHEM
dataclass
FaithfulnesswithHHEM(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {
"user_input",
"response",
"retrieved_contexts",
}
}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "faithfulness_with_hhem",
nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
sentence_segmenter: Optional[HasSegmentMethod] = None,
max_retries: int = 1,
_reproducibility: int = 1,
device: str = "cpu",
batch_size: int = 10,
)
Bases: Faithfulness
NoiseSensitivity
dataclass
NoiseSensitivity(
_required_columns: Dict[
MetricType, Set[str]
] = lambda: {
SINGLE_TURN: {
"user_input",
"response",
"reference",
"retrieved_contexts",
}
}(),
llm: Optional[BaseRagasLLM] = None,
name: str = "noise_sensitivity",
focus: Literal["relevant", "irrelevant"] = "relevant",
nli_statements_message: PydanticPrompt = NLIStatementPrompt(),
statement_prompt: PydanticPrompt = LongFormAnswerPrompt(),
sentence_segmenter: Optional[HasSegmentMethod] = None,
max_retries: int = 1,
_reproducibility: int = 1,
)
Bases: MetricWithLLM
, SingleTurnMetric