LLM 评估示例

下面列出的笔记本包含了如何使用 MLflow 评估 LLMs 的逐步教程。第一个笔记本围绕着使用提示工程方法评估问答的 LLM。第二个笔记本围绕着评估 RAG 系统。两个笔记本都将展示如何使用 MLflow 的内置指标，如 token_count 和 toxicity，以及 LLM 评判的智能指标，如 answer_relevance。第三个笔记本与第二个笔记本相同，但使用 Databricks 提供的 llama2-70b 作为评判，而不是 gpt-4。

QA 评估教程

Learn how to evaluate various LLMs and RAG systems with MLflow, leveraging simple metrics such as toxicity, as well as LLM-judged metrics as relevance, and even custom LLM-judged metrics such as professionalism.

Learn how to evaluate various Open-Source LLMs available in Hugging Face, leveraging MLflow's built-in LLM metrics and experiment tracking to manage models and evaluation results.

RAG 评估教程

Learn how to evaluate RAG systems with MLflow, leveraging OpenAI GPT-4 model as a judge.

Learn how to evaluate RAG systems with MLflow, leveraging Llama 2 70B model hosted on Databricks serving endpoint.