dask_expr._collection.DataFrame.describe

dask_expr._collection.DataFrame.describe¶

DataFrame.describe(split_every=False, percentiles=None, percentiles_method='default', include=None, exclude=None)[源代码]¶

生成描述性统计数据。

此文档字符串是从 pandas.core.frame.DataFrame.describe 复制而来的。

Dask 版本可能存在一些不一致性。

描述性统计包括那些总结数据集分布的中心趋势、离散度和形状的统计量，不包括 NaN 值。

分析数值和对象序列，以及混合数据类型的 DataFrame 列集合。输出将根据提供的内容而变化。请参阅下面的注释以获取更多详细信息。

参数

百分位数数字列表，可选

要在输出中包含的百分位数。所有值应在0到1之间。默认值是 [.25, .5, .75]，这将返回第25、50和75百分位数。

包含‘all’, 类似列表的 dtypes 或 None (默认), 可选

要在结果中包含的数据类型的白名单。对于 Series 忽略。以下是选项：

‘all’ : 输入中的所有列都将包含在输出中。
数据类型列表 : 将结果限制为提供的数据类型。要将结果限制为数值类型，请提交 numpy.number。要将其限制为对象列，请提交 numpy.object 数据类型。字符串也可以用于 select_dtypes 的风格（例如 df.describe(include=['O'])）。要选择 pandas 分类列，请使用 'category'
None (默认) : 结果将包含所有数值列。

排除dtypes 或 None 的类列表（默认），可选，

要从结果中省略的数据类型的黑名单。对于 Series 忽略。以下是选项：

类似列表的 dtypes : 从结果中排除提供的数据类型。要排除数值类型，请提交 numpy.number。要排除对象列，请提交数据类型 numpy.object。字符串也可以用于 select_dtypes 的风格（例如 df.describe(exclude=['O'])）。要排除 pandas 分类列，请使用 'category'
None (默认) : 结果将不排除任何内容。

返回

Series 或 DataFrame: 提供的 Series 或 Dataframe 的汇总统计数据。

参见

DataFrame.count: 计算非NA/null观测值的数量。
DataFrame.max: 对象中值的最大值。
DataFrame.min: 对象中值的最小值。
DataFrame.mean: 值的平均数。
DataFrame.std: 观测值的标准差。
DataFrame.select_dtypes: 基于列的数据类型包含/排除的 DataFrame 子集。

注释

对于数值数据，结果的索引将包括 count、mean、std、min、max 以及下限、50 和上限百分位数。默认情况下，下限百分位数为 25，上限百分位数为 75。50 百分位数与中位数相同。

对于对象数据（例如字符串或时间戳），结果的索引将包括 count、unique、top 和 freq。top 是最常见的值。freq 是最常见值的频率。时间戳还包括 first 和 last 项。

如果多个对象值具有最高的计数，那么 count 和 top 结果将任意从这些具有最高计数的对象中选择。

对于通过 DataFrame 提供的混合数据类型，默认情况下只返回数值列的分析。如果数据框仅由对象和分类数据组成，没有任何数值列，默认情况下将返回对象和分类列的分析。如果提供了 include='all' 作为选项，结果将包括每种类型的属性的并集。

include 和 exclude 参数可以用来限制在输出中分析 DataFrame 中的哪些列。在分析 Series 时，这些参数将被忽略。

示例

描述一个数值 Series。

>>> s = pd.Series([1, 2, 3])  
>>> s.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
dtype: float64

描述一个分类 Series。

>>> s = pd.Series(['a', 'a', 'b', 'c'])  
>>> s.describe()  
count     4
unique    3
top       a
freq      2
dtype: object

描述一个时间戳 Series。

>>> s = pd.Series([  
...     np.datetime64("2000-01-01"),
...     np.datetime64("2010-01-01"),
...     np.datetime64("2010-01-01")
... ])
>>> s.describe()  
count                      3
mean     2006-09-01 08:00:00
min      2000-01-01 00:00:00
25%      2004-12-31 12:00:00
50%      2010-01-01 00:00:00
75%      2010-01-01 00:00:00
max      2010-01-01 00:00:00
dtype: object

描述一个 DataFrame 。默认情况下只返回数值字段。

>>> df = pd.DataFrame({'categorical': pd.Categorical(['d', 'e', 'f']),  
...                    'numeric': [1, 2, 3],
...                    'object': ['a', 'b', 'c']
...                    })
>>> df.describe()  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

描述 DataFrame 的所有列，无论数据类型如何。

>>> df.describe(include='all')  
       categorical  numeric object
count            3      3.0      3
unique           3      NaN      3
top              f      NaN      a
freq             1      NaN      1
mean           NaN      2.0    NaN
std            NaN      1.0    NaN
min            NaN      1.0    NaN
25%            NaN      1.5    NaN
50%            NaN      2.0    NaN
75%            NaN      2.5    NaN
max            NaN      3.0    NaN

通过将 DataFrame 中的一列作为属性来描述它。

>>> df.numeric.describe()  
count    3.0
mean     2.0
std      1.0
min      1.0
25%      1.5
50%      2.0
75%      2.5
max      3.0
Name: numeric, dtype: float64

仅在 DataFrame 描述中包含数值列。

>>> df.describe(include=[np.number])  
       numeric
count      3.0
mean       2.0
std        1.0
min        1.0
25%        1.5
50%        2.0
75%        2.5
max        3.0

仅在 DataFrame 描述中包含字符串列。

>>> df.describe(include=[object])  
       object
count       3
unique      3
top         a
freq        1

仅包含 DataFrame 描述中的分类列。

>>> df.describe(include=['category'])  
       categorical
count            3
unique           3
top              d
freq             1

从 DataFrame 描述中排除数值列。

>>> df.describe(exclude=[np.number])  
       categorical object
count            3      3
unique           3      3
top              f      a
freq             1      1

从 DataFrame 描述中排除对象列。

>>> df.describe(exclude=[object])  
       categorical  numeric
count            3      3.0
unique           3      NaN
top              f      NaN
freq             1      NaN
mean           NaN      2.0
std            NaN      1.0
min            NaN      1.0
25%            NaN      1.5
50%            NaN      2.0
75%            NaN      2.5
max            NaN      3.0

dask_expr._collection.DataFrame.cumsum

dask_expr._collection.DataFrame.diff