Skip to content

可用设置

为了自定义 ydata-profiling 的行为和生成的报告的外观,提供了一组选项。这种深度的自定义允许创建针对特定分析数据集的高度定制的行为。以下列出了可用的设置。要了解如何更改它们,请查看 :doc:changing_settings

通用设置

全局报告设置:

Parameter Type Default Description
title string Pandas Profiling Report Title for the report, shown in the header and title bar.
pool_size integer 0 Number of workers in thread pool. When set to zero, it is set to the number of CPUs available.
progress_bar boolean True If True, ydata-profiling will display a progress bar.

变量摘要设置

与每个变量显示的信息相关的设置。

Parameter Type Default Description
sort None, asc or desc nan Sort the variables asc (ending), desc (ending) or None (leaves original sorting).
variables.descriptions dict {} Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'}).
vars.num.quantiles list[float] [0.05,0.25,0.5,0.75,0.95] The quantiles to calculate. Note that .25, .5 and .75 are required for the computation of other metrics (median and IQR).
vars.num.skewness_threshold integer 20 Warn if the skewness is above this threshold.
vars.num.low_categorical_threshold integer 5 If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable.
vars.num.chi_squared_threshold float 0.999 Set to 0 to disable chi-squared calculation.
vars.cat.length boolean True Check the string length and aggregate values (min, max, mean, media).
vars.cat.characters boolean False Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive.
vars.cat.words boolean False Check the distribution of words. Often informative, but may be computationally expensive.
vars.cat.cardinality_threshold integer 50 Warn if the number of distinct values is above this threshold.
vars.cat.imbalance_threshold float 0.5 Warn if the imbalance score is above this threshold.
vars.cat.n_obs integer 5 Display this number of observations.
vars.cat.chi_squared_threshold float 0.999 Same as above, but for categorical variables.
vars.bool.n_obs integer 3 Same as above, but for boolean variables.
vars.bool.imbalance_threshold float 0.5 Warn if the imbalance score is above this threshold.
配置示例
  profile = df.profile_report(
      sort="ascending",
      vars={
          "num": {"low_categorical_threshold": 0},
          "cat": {
              "length": True,
              "characters": False,
              "words": False,
              "n_obs": 5,
          },
      },
  )

  profile.config.variables.descriptions = {
      "files": "文件系统中的文件",
      "datec": "创建日期",
      "datem": "修改日期",
  }

  profile.to_file("report.html")

设置数据集模式类型

为给定数据集配置模式类型。

设置变量类型模式以生成配置文件报告
  import json
  import pandas as pd

  from ydata_profiling import ProfileReport
  from ydata_profiling.utils.cache import cache_file

  file_name = cache_file(
      "titanic.csv",
      "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv",
  )
  df = pd.read_csv(file_name)

  type_schema = {"Survived": "categorical", "Embarked": "categorical"}

  # 我们只能为确定类型的变量设置 type_schema。
  # 所有其他变量将自动推断。
  report = ProfileReport(df, title="Titanic EDA", type_schema=type_schema)

  report.to_file("report.html")

缺失数据概览图

与缺失数据部分及其包含的可视化相关的设置。

Parameter Type Default Description
missing_diagrams.bar boolean True Display a bar chart with counts of missing values for each column.
missing_diagrams.matrix boolean True Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows.
missing_diagrams.heatmap boolean True Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another).
配置示例:禁用大型数据集的热图
1
2
3
4
5
6
  profile = df.profile_report(
      missing_diagrams={
          "heatmap": False,
      }
  )
  profile.to_file("report.html")

相关性

关于相关性度量和阈值的设置。 默认值为 auto,但以下相关性矩阵可用:

Parameter Description
auto Calculates the column pairwise correlation depending on the type schema:
- numerical to numerical variable: Spearman correlation coefficient
- categorical to categorical variable: Cramer's V association coefficient
- numerical to categorical: Cramer's V association coefficient with the numerical variable discretized automatically
spearman Spearman's correlation measures the strength and direction of monotonic association between two variables. Great to evaluate the strength of the relation between categorical or ordinal variables.
pearson The Pearson correlation coefficient is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables.
kendall Kendall rank correlation coefficient is a statistic used to measure the ordinal association between two measured quantities. Kendall's is often used when data doesn't meet one of the requirements of Pearson's correlation.
phi_k Phi K is especially suitable for working with mixed-type variables. Using this coefficient we can find (un)expected correlation and evaluate their statistical significance.
cramers Cramers is a correlation matrix that is commonly used to examine the association between categorical variables when there is more than 2x2 contingency.

对于每个相关性矩阵,您可以使用以下配置:

Parameter Type Default Description
correlations.auto.calculate boolean True Whether to compute 'auto' correlation
correlations.auto.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.auto.threshold float 0.9 Warning threshold
correlations.pearson.calculate boolean False Whether to calculate Pearson correlation
correlations.pearson.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.pearson.threshold float 0.9 Warning threshold
correlations.spearman.calculate boolean False Whether to calculate Spearman correlation
correlations.spearman.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.spearman.threshold float 0.9 Warning threshold
correlations.kendall.calculate boolean False Whether to calculate Kendall rank correlation
correlations.kendall.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.kendall.threshold float 0.9 Warning threshold
correlations.phi_k.calculate boolean False Whether to calculate Phi K correlation
correlations.phi_k.warn_high_correlations boolean False Show warning for correlations higher than the threshold
correlations.phi_k.threshold float 0.9 Warning threshold
correlations.cramers.calculate boolean False Whether to calculate Cramer's V association coefficient
correlations.cramers.warn_high_correlations boolean True Show warning for correlations higher than the threshold
correlations.cramers.threshold float 0.9 Warning threshold

例如,禁用所有相关性计算(对于大型数据集可能相关):

禁用所有相关性矩阵
    profile = df.profile_report(
        title="无相关性的报告",
        correlations={
            "auto": {"calculate": False},
            "pearson": {"calculate": False},
            "spearman": {"calculate": False},
            "kendall": {"calculate": False},
            "phi_k": {"calculate": False},
            "cramers": {"calculate": False},
        },
    )

    # 或者使用相关性可用的简写
    profile = df.profile_report(
        title="无相关性的报告",
        correlations=None,
    )

交互

与交互部分相关的设置。

Parameter Type Default Description
interactions.continuous boolean True Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs.
interactions.targets list [] When a list of variable names is given, only interactions between these and all other variables are computed.

报告的外观

与报告的外观和样式相关的设置。

Parameter Type Default Description
html.minify_html bool True If True, the output HTML is minified using the htmlmin package.
html.use_local_assets bool True If True, all assets (stylesheets, scripts, images) are stored locally. If False, a CDN is used for some stylesheets and scripts.
html.inline boolean True If True, all assets are contained in the report. If False, then a web export is created, where all assets are stored in the '[REPORT_NAME]_assets/' directory.
html.navbar_show boolean True Whether to include a navigation bar in the report
html.style.theme string None Select a bootswatch theme. Available options: flatly (dark) and united (orange)
html.style.logo string nan A base64 encoded logo, to display in the navigation bar.
html.style.primary_color string #337ab7 The primary color to use in the report.
html.style.full_width boolean False By default, the width of the report is fixed. If set to True, the full width of the screen is used.