.. _10min_tut_06_stats: {{ header }} .. ipython:: python import pandas as pd .. raw:: html
Data used for this tutorial:
How to calculate summary statistics ----------------------------------- Aggregating statistics ~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_aggregate.svg :align: center .. raw:: html Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default. .. image:: ../../_static/schemas/06_reduction.svg :align: center .. raw:: html The aggregating statistic can be calculated for multiple columns at the same time. Remember the ``describe`` function from the :ref:`first tutorial <10min_tut_01_tableoriented>`? .. ipython:: python titanic[["Age", "Fare"]].describe() Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the :func:`DataFrame.agg` method: .. ipython:: python titanic.agg( { "Age": ["min", "max", "median", "skew"], "Fare": ["min", "max", "median", "mean"], } ) .. raw:: html
To user guide Details about descriptive statistics are provided in the user guide section on :ref:`descriptive statistics `. .. raw:: html
Aggregating statistics grouped by category ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_groupby.svg :align: center .. raw:: html Calculating a given statistic (e.g. ``mean`` age) *for each category in a column* (e.g. male/female in the ``Sex`` column) is a common pattern. The ``groupby`` method is used to support this type of operations. This fits in the more general ``split-apply-combine`` pattern: - **Split** the data into groups - **Apply** a function to each group independently - **Combine** the results into a data structure The apply and combine steps are typically done together in pandas. In the previous example, we explicitly selected the 2 columns first. If not, the ``mean`` method is applied to each column containing numerical columns by passing ``numeric_only=True``: .. ipython:: python titanic.groupby("Sex").mean(numeric_only=True) It does not make much sense to get the average value of the ``Pclass``. If we are only interested in the average age for each gender, the selection of columns (square brackets ``[]`` as usual) is supported on the grouped data as well: .. ipython:: python titanic.groupby("Sex")["Age"].mean() .. image:: ../../_static/schemas/06_groupby_select_detail.svg :align: center .. note:: The ``Pclass`` column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a ``Categorical`` data type to handle this type of data. More information is provided in the user guide :ref:`categorical` section. .. raw:: html .. raw:: html
To user guide A full description on the split-apply-combine approach is provided in the user guide section on :ref:`groupby operations `. .. raw:: html
Count number of records by category ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. image:: ../../_static/schemas/06_valuecounts.svg :align: center .. raw:: html The function is a shortcut, it is actually a groupby operation in combination with counting the number of records within each group: .. ipython:: python titanic.groupby("Pclass")["Pclass"].count() .. note:: Both ``size`` and ``count`` can be used in combination with ``groupby``. Whereas ``size`` includes ``NaN`` values and just provides the number of rows (size of the table), ``count`` excludes the missing values. In the ``value_counts`` method, use the ``dropna`` argument to include or exclude the ``NaN`` values. .. raw:: html
To user guide The user guide has a dedicated section on ``value_counts`` , see the page on :ref:`discretization `. .. raw:: html
.. raw:: html

REMEMBER

- Aggregation statistics can be calculated on entire columns or rows. - ``groupby`` provides the power of the *split-apply-combine* pattern. - ``value_counts`` is a convenient shortcut to count the number of entries in each category of a variable. .. raw:: html
.. raw:: html
To user guide A full description on the split-apply-combine approach is provided in the user guide pages about :ref:`groupby operations `. .. raw:: html