可用设置
为了自定义 ydata-profiling
的行为和生成的报告的外观,提供了一组选项。这种深度的自定义允许创建针对特定分析数据集的高度定制的行为。以下列出了可用的设置。要了解如何更改它们,请查看 :doc:changing_settings
。
通用设置
全局报告设置:
Parameter | Type | Default | Description |
---|---|---|---|
title |
string | Pandas Profiling Report | Title for the report, shown in the header and title bar. |
pool_size |
integer | 0 | Number of workers in thread pool. When set to zero, it is set to the number of CPUs available. |
progress_bar |
boolean | True |
If True , ydata-profiling will display a progress bar. |
变量摘要设置
与每个变量显示的信息相关的设置。
Parameter | Type | Default | Description |
---|---|---|---|
sort |
None, asc or desc | nan | Sort the variables asc (ending), desc (ending) or None (leaves original sorting). |
variables.descriptions |
dict | {} | Ability to display a description alongside the descriptive statistics of each variable ({'var_name': 'Description'}). |
vars.num.quantiles |
list[float] | [0.05,0.25,0.5,0.75,0.95] | The quantiles to calculate. Note that .25, .5 and .75 are required for the computation of other metrics (median and IQR). |
vars.num.skewness_threshold |
integer | 20 | Warn if the skewness is above this threshold. |
vars.num.low_categorical_threshold |
integer | 5 | If the number of distinct values is smaller than this number, then the series is considered to be categorical. Set to 0 to disable. |
vars.num.chi_squared_threshold |
float | 0.999 | Set to 0 to disable chi-squared calculation. |
vars.cat.length |
boolean | True |
Check the string length and aggregate values (min, max, mean, media). |
vars.cat.characters |
boolean | False |
Check the distribution of characters and their Unicode properties. Often informative, but may be computationally expensive. |
vars.cat.words |
boolean | False |
Check the distribution of words. Often informative, but may be computationally expensive. |
vars.cat.cardinality_threshold |
integer | 50 | Warn if the number of distinct values is above this threshold. |
vars.cat.imbalance_threshold |
float | 0.5 | Warn if the imbalance score is above this threshold. |
vars.cat.n_obs |
integer | 5 | Display this number of observations. |
vars.cat.chi_squared_threshold |
float | 0.999 | Same as above, but for categorical variables. |
vars.bool.n_obs |
integer | 3 | Same as above, but for boolean variables. |
vars.bool.imbalance_threshold |
float | 0.5 | Warn if the imbalance score is above this threshold. |
设置数据集模式类型
为给定数据集配置模式类型。
缺失数据概览图
与缺失数据部分及其包含的可视化相关的设置。
Parameter | Type | Default | Description |
---|---|---|---|
missing_diagrams.bar |
boolean | True |
Display a bar chart with counts of missing values for each column. |
missing_diagrams.matrix |
boolean | True |
Display a matrix of missing values. Similar to the bar chart, but might provide overview of the co-occurrence of missing values in rows. |
missing_diagrams.heatmap |
boolean | True |
Display a heatmap of missing values, that measures nullity correlation (i.e. how strongly the presence or absence of one variable affects the presence of another). |
配置示例:禁用大型数据集的热图 | |
---|---|
相关性
关于相关性度量和阈值的设置。
默认值为 auto
,但以下相关性矩阵可用:
Parameter | Description |
---|---|
auto |
Calculates the column pairwise correlation depending on the type schema: |
- numerical to numerical variable: Spearman correlation coefficient | |
- categorical to categorical variable: Cramer's V association coefficient | |
- numerical to categorical: Cramer's V association coefficient with the numerical variable discretized automatically | |
spearman |
Spearman's correlation measures the strength and direction of monotonic association between two variables. Great to evaluate the strength of the relation between categorical or ordinal variables. |
pearson |
The Pearson correlation coefficient is the most common way of measuring a linear correlation. It is a number between –1 and 1 that measures the strength and direction of the relationship between two variables. |
kendall |
Kendall rank correlation coefficient is a statistic used to measure the ordinal association between two measured quantities. Kendall's is often used when data doesn't meet one of the requirements of Pearson's correlation. |
phi_k |
Phi K is especially suitable for working with mixed-type variables. Using this coefficient we can find (un)expected correlation and evaluate their statistical significance. |
cramers |
Cramers is a correlation matrix that is commonly used to examine the association between categorical variables when there is more than 2x2 contingency. |
对于每个相关性矩阵,您可以使用以下配置:
Parameter | Type | Default | Description |
---|---|---|---|
correlations.auto.calculate |
boolean | True |
Whether to compute 'auto' correlation |
correlations.auto.warn_high_correlations |
boolean | True |
Show warning for correlations higher than the threshold |
correlations.auto.threshold |
float | 0.9 | Warning threshold |
correlations.pearson.calculate |
boolean | False |
Whether to calculate Pearson correlation |
correlations.pearson.warn_high_correlations |
boolean | True |
Show warning for correlations higher than the threshold |
correlations.pearson.threshold |
float | 0.9 | Warning threshold |
correlations.spearman.calculate |
boolean | False |
Whether to calculate Spearman correlation |
correlations.spearman.warn_high_correlations |
boolean | False |
Show warning for correlations higher than the threshold |
correlations.spearman.threshold |
float | 0.9 | Warning threshold |
correlations.kendall.calculate |
boolean | False |
Whether to calculate Kendall rank correlation |
correlations.kendall.warn_high_correlations |
boolean | False |
Show warning for correlations higher than the threshold |
correlations.kendall.threshold |
float | 0.9 | Warning threshold |
correlations.phi_k.calculate |
boolean | False |
Whether to calculate Phi K correlation |
correlations.phi_k.warn_high_correlations |
boolean | False |
Show warning for correlations higher than the threshold |
correlations.phi_k.threshold |
float | 0.9 | Warning threshold |
correlations.cramers.calculate |
boolean | False |
Whether to calculate Cramer's V association coefficient |
correlations.cramers.warn_high_correlations |
boolean | True |
Show warning for correlations higher than the threshold |
correlations.cramers.threshold |
float | 0.9 | Warning threshold |
例如,禁用所有相关性计算(对于大型数据集可能相关):
交互
与交互部分相关的设置。
Parameter | Type | Default | Description |
---|---|---|---|
interactions.continuous |
boolean | True |
Generate a 2D scatter plot (or hexagonal binned plot) for all continuous variable pairs. |
interactions.targets |
list | [] | When a list of variable names is given, only interactions between these and all other variables are computed. |
报告的外观
与报告的外观和样式相关的设置。
Parameter | Type | Default | Description |
---|---|---|---|
html.minify_html |
bool | True |
If True , the output HTML is minified using the htmlmin package. |
html.use_local_assets |
bool | True |
If True , all assets (stylesheets, scripts, images) are stored locally. If False , a CDN is used for some stylesheets and scripts. |
html.inline |
boolean | True |
If True , all assets are contained in the report. If False , then a web export is created, where all assets are stored in the '[REPORT_NAME]_assets/' directory. |
html.navbar_show |
boolean | True |
Whether to include a navigation bar in the report |
html.style.theme |
string | None |
Select a bootswatch theme. Available options: flatly (dark) and united (orange) |
html.style.logo |
string | nan | A base64 encoded logo, to display in the navigation bar. |
html.style.primary_color |
string | #337ab7 | The primary color to use in the report. |
html.style.full_width |
boolean | False |
By default, the width of the report is fixed. If set to True , the full width of the screen is used. |