pandas 文档字符串指南#

关于文档字符串和标准#

Python 文档字符串是一个用于文档化 Python 模块、类、函数或方法的字符串，这样程序员可以理解它的功能，而无需阅读实现的细节。

此外，从文档字符串自动生成在线（html）文档是一种常见做法。Sphinx 服务于这一目的。

下一个示例展示了docstring的样子：

def add(num1, num2):
    """
    Add up two integer numbers.

    This function simply wraps the ``+`` operator, and does not
    do anything interesting, except for illustrating what
    the docstring of a very simple function looks like.

    Parameters
    ----------
    num1 : int
        First number to add.
    num2 : int
        Second number to add.

    Returns
    -------
    int
        The sum of ``num1`` and ``num2``.

    See Also
    --------
    subtract : Subtract one integer from another.

    Examples
    --------
    >>> add(2, 2)
    4
    >>> add(25, 0)
    25
    >>> add(10, -10)
    0
    """
    return num1 + num2

关于文档字符串存在一些标准，这使得它们更容易阅读，并允许它们轻松导出到其他格式，如html或pdf。

每个 Python 文档字符串应遵循的第一条约定在 PEP-257 中定义。

由于 PEP-257 相当广泛，还存在其他更具体的标准。在 pandas 的情况下，遵循 NumPy 文档字符串约定。这些约定在本文件中有解释：

numpydoc docstring 指南

numpydoc 是一个支持 NumPy 文档字符串约定的 Sphinx 扩展。

该标准使用 reStructuredText (reST)。reStructuredText 是一种标记语言，允许在纯文本文件中编码样式。关于 reStructuredText 的文档可以在以下位置找到：

Sphinx reStructuredText 入门
快速 reStructuredText 参考
完整的 reStructuredText 规范

pandas 有一些帮助工具用于在相关类之间共享文档字符串，参见共享文档字符串。

本文档的其余部分将总结上述所有指南，并提供特定于pandas项目的附加约定。

编写一个文档字符串#

一般规则#

文档字符串必须用三个双引号定义。文档字符串前后不应留有空白行。文本从开引号的下一行开始。闭引号有自己的一行（这意味着它们不在最后一句话的末尾）。

在极少数情况下，reST 样式如粗体文本或斜体会用于文档字符串中，但常见的是使用内联代码，这些代码用反引号括起来。以下被视为内联代码：

参数的名称
Python 代码、模块、函数、内置函数、类型、字面量…（例如 os、list、numpy.abs、datetime.date、True）
一个 pandas 类（形式为 :class:`pandas.Series`）
一个 pandas 方法（形式为 :meth:`pandas.Series.sum`）
一个 pandas 函数（形式为 :func:`pandas.to_datetime`）

备注

要仅显示链接的类、方法或函数的最后一个组件，请在其前面加上 ~。例如，:class:`~pandas.Series` 将链接到 pandas.Series 但仅显示最后一部分 Series 作为链接文本。详情请参见 Sphinx 交叉引用语法。

好:

def add_values(arr):
    """
    Add the values in ``arr``.

    This is equivalent to Python ``sum`` of :meth:`pandas.Series.sum`.

    Some sections are omitted here for simplicity.
    """
    return sum(arr)

错误：

def func():

    """Some function.

    With several mistakes in the docstring.

    It has a blank line after the signature ``def func():``.

    The text 'Some function' should go in the line after the
    opening quotes of the docstring, not in the same line.

    There is a blank line between the docstring and the first line
    of code ``foo = 1``.

    The closing quotes should be in the next line, not in this one."""

    foo = 1
    bar = 2
    return foo + bar

第1节：简要总结#

简短的总结是用一句话简洁地表达函数的功能。

简短的摘要必须以大写字母开头，以句号结尾，并且适合单行。它需要表达对象的功能，而不提供详细信息。对于函数和方法，简短的摘要必须以不定式动词开头。

好:

def astype(dtype):
    """
    Cast Series type.

    This section will provide further details.
    """
    pass

错误：

def astype(dtype):
    """
    Casts Series type.

    Verb in third-person of the present simple, should be infinitive.
    """
    pass

def astype(dtype):
    """
    Method to cast Series type.

    Does not start with verb.
    """
    pass

def astype(dtype):
    """
    Cast Series type

    Missing dot at the end.
    """
    pass

def astype(dtype):
    """
    Cast Series type from its current type to the new type defined in
    the parameter dtype.

    Summary is too verbose and doesn't fit in a single line.
    """
    pass

第2节：扩展摘要#

扩展摘要提供了关于函数功能的详细信息。它不应涉及参数的细节，或讨论实现注记，这些内容应在其他部分讨论。

在简短摘要和扩展摘要之间留有一个空行。扩展摘要中的每个段落都以句号结尾。

扩展摘要应提供有关该函数为何有用及其使用场景的详细信息，如果不过于通用的话。

def unstack():
    """
    Pivot a row index to columns.

    When using a MultiIndex, a level can be pivoted so each value in
    the index becomes a column. This is especially useful when a subindex
    is repeated for the main index, and data is easier to visualize as a
    pivot table.

    The index level will be automatically removed from the index when added
    as columns.
    """
    pass

第3节：参数#

本节将添加参数的详细信息。本节标题为“参数”，后面跟着一行在单词“参数”每个字母下加连字符的行。在节标题前留一个空白行，但不在其后，也不在“参数”这个词和带连字符的行之间。

在标题之后，签名中的每个参数都必须记录，包括 *args 和 **kwargs，但不包括 self。

参数通过它们的名称定义，后面跟一个空格、一个冒号、另一个空格和类型（或多种类型）。注意名称和冒号之间的空格是重要的。类型没有为 *args 和 **kwargs 定义，但必须为所有其他参数定义。在参数定义之后，需要有一行带有参数描述的行，该行是缩进的，并且可以有多行。描述必须以大写字母开头，并以句号结束。

对于带有默认值的关键字参数，默认值将在类型的末尾用逗号列出。在这种情况下，类型的确切形式将是“int, default 0”。在某些情况下，解释默认参数的含义可能是有用的，这可以在逗号后添加“int, default -1, meaning all cpus”。

在默认值为 None 的情况下，这意味着该值将不会被使用。与其写 "str, default None"，更倾向于写 "str, optional"。当 None 是一个被使用的值时，我们会保留形式 “str, default None”。例如，在 df.to_csv(compression=None) 中，None 不是一个被使用的值，而是意味着压缩是可选的，如果不提供则不使用压缩。在这种情况下我们会使用 "str, optional"。只有在像 func(value=None) 这样的情况，并且 None 以与 0 或 foo 相同的方式被使用时，我们才会指定 “str, int or None, default None”。

好:

class Series:
    def plot(self, kind, color='blue', **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Parameters
        ----------
        kind : str
            Kind of matplotlib plot.
        color : str, default 'blue'
            Color name or rgb code.
        **kwargs
            These parameters will be passed to the matplotlib plotting
            function.
        """
        pass

错误：

class Series:
    def plot(self, kind, **kwargs):
        """
        Generate a plot.

        Render the data in the Series as a matplotlib plot of the
        specified kind.

        Note the blank line between the parameters title and the first
        parameter. Also, note that after the name of the parameter ``kind``
        and before the colon, a space is missing.

        Also, note that the parameter descriptions do not start with a
        capital letter, and do not finish with a dot.

        Finally, the ``**kwargs`` parameter is missing.

        Parameters
        ----------

        kind: str
            kind of matplotlib plot
        """
        pass

参数类型#

在指定参数类型时，可以直接使用 Python 内置数据类型（首选 Python 类型而不是更详细的字符串、整数、布尔等）：

int
float
str
bool

对于复杂类型，定义子类型。对于 dict 和 tuple ，由于存在多种类型，我们使用括号来帮助阅读类型（dict 用花括号，tuple 用普通括号）：

list of int
dict of {str : int}
tuple of (str, int, int)
元组 (str,)
set of str

如果只有一组允许的值，请将它们列在大括号中并用逗号（后跟一个空格）分隔。如果这些值是有序的并且有顺序，请按此顺序列出它们。否则，如果有默认值，请先列出默认值：

{0, 10, 25}
{‘simple’, ‘advanced’}
{‘低’, ‘中’, ‘高’}
{‘猫’, ‘狗’, ‘鸟’}

如果类型在 Python 模块中定义，则必须指定模块：

datetime.date
datetime.datetime
decimal.Decimal

如果类型在一个包中，模块也必须被指定：

numpy.ndarray
scipy.sparse.coo_matrix

如果类型是 pandas 类型，也需要指定 pandas，除了 Series 和 DataFrame 之外：

系列
DataFrame
pandas.Index
pandas.Categorical
pandas.arrays.SparseArray

如果确切的类型不相关，但必须与 NumPy 数组兼容，可以指定 array-like。如果接受任何可迭代类型，可以使用 iterable：

array-like
iterable

如果接受多种类型，请用逗号分隔它们，除了最后两种类型，它们需要用单词 ‘或’ 分隔：

int 或 float
float, decimal.Decimal 或 None
str 或 str 列表

如果 None 是接受的值之一，它总是需要列在最后。

对于轴，惯例是使用类似的东西：

axis : {0 或 ‘index’, 1 或 ‘columns’, None}, 默认 None

第4节：返回或产生#

如果该方法返回一个值，它将在此部分中记录。同样，如果该方法产生其输出。

章节的标题将以与“参数”相同的方式定义。名称可以是“返回”或“生成”，后面跟一行与前一个单词字母数相同的连字符。

返回值的文档也类似于参数。但在这种情况下，除非方法返回或生成多个值（值的元组），否则不会提供名称。

“Returns” 和 “Yields” 的类型与 “Parameters” 的类型相同。此外，描述必须以句号结尾。

例如，使用单个值：

def sample():
    """
    Generate and return a random number.

    The value is sampled from a continuous uniform distribution between
    0 and 1.

    Returns
    -------
    float
        Random number generated.
    """
    return np.random.random()

有多个值时：

import string

def random_letters():
    """
    Generate and return a sequence of random letters.

    The length of the returned string is also random, and is also
    returned.

    Returns
    -------
    length : int
        Length of the returned string.
    letters : str
        String of random letters.
    """
    length = np.random.randint(1, 10)
    letters = ''.join(np.random.choice(string.ascii_lowercase)
                      for i in range(length))
    return length, letters

如果该方法产生其值：

def sample_values():
    """
    Generate an infinite sequence of random numbers.

    The values are sampled from a continuous uniform distribution between
    0 and 1.

    Yields
    ------
    float
        Random number generated.
    """
    while True:
        yield np.random.random()

第5节：另请参见#

本节用于让用户了解与正在记录的功能相关的 pandas 功能。在极少数情况下，如果根本找不到相关的方法或函数，则可以跳过本节。

一个明显的例子是 head() 和 tail() 方法。由于 tail() 的作用与 head() 相同，但位于 Series 或 DataFrame 的末尾而不是开头，因此让用户了解这一点是很好的。

为了直观地了解可以被认为是相关的，这里有一些例子：

loc 和 iloc，它们做同样的事情，但在一种情况下提供索引，在另一种情况下提供位置
max 和 min，因为它们做相反的事情
iterrows、itertuples 和 items，因为用户很容易在寻找遍历列的方法时，最终找到遍历行的方法，反之亦然。
fillna 和 dropna，这两种方法都用于处理缺失值
read_csv 和 to_csv，因为它们是互补的
merge 和 join，因为一个是另一个的泛化
astype 和 pandas.to_datetime，因为用户可能会阅读 astype 的文档以了解如何转换为日期，而实现这一点的方法是使用 pandas.to_datetime
where 与 numpy.where 相关，因为它的功能基于它

在决定什么是相关的时候，你应该主要使用你的常识，并思考什么对阅读文档的用户有用，特别是那些经验较少的用户。

当涉及到其他库（主要是 numpy）时，首先使用模块的名称（不是像 np 这样的别名）。如果函数在一个不是主模块的模块中，比如 scipy.sparse，列出完整的模块（例如 scipy.sparse.coo_matrix）。

本节有一个标题，”参见”（注意大写的S和A），后面是带有连字符的行，前面有一空行。

在标题之后，我们将为每个相关的方法或函数添加一行，后面跟一个空格、一个冒号、另一个空格和一个简短的描述，说明这个方法或函数的作用、为什么在这个上下文中相关，以及文档化函数与被引用函数之间的关键区别。描述也必须以句号结尾。

请注意，在“Returns”和“Yields”中，描述位于类型之后的行上。然而，在这一节中，它位于同一行，中间用冒号分隔。如果描述不适合同一行，它可以继续到其他行，这些行必须进一步缩进。

例如：

class Series:
    def head(self):
        """
        Return the first 5 elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying the whole of it.

        Returns
        -------
        Series
            Subset of the original series with the 5 first values.

        See Also
        --------
        Series.tail : Return the last 5 elements of the Series.
        Series.iloc : Return a slice of the elements in the Series,
            which can also be used to return the first or last n.
        """
        return self.iloc[:5]

第6节：注释#

这是一个可选部分，用于记录关于算法实现或函数行为技术方面的注释。

除非你熟悉该算法的实现，或者在为该函数编写示例时发现了某些反直觉的行为，否则可以随意跳过。

本节遵循与扩展摘要部分相同的格式。

第7节：示例#

这是文档字符串中最重要的一部分，尽管它位于最后，因为通常人们通过例子比通过准确的解释更好地理解概念。

文档字符串中的示例，除了说明函数或方法的用法外，必须是有效的Python代码，能够以确定性的方式返回给定的输出，并且可以被用户复制和运行。

示例以 Python 终端会话的形式呈现。>>> 用于表示代码。... 用于表示从前一行继续的代码。输出紧跟在生成输出的代码的最后一行之后（中间没有空行）。描述示例的注释可以在它们之前和之后添加空行。

展示示例的方式如下：

导入所需的库（除了 numpy 和 pandas）
创建示例所需的数据
展示一个非常基础的例子，这个例子展示了最常见的使用情况
添加带有解释的示例，说明如何使用参数来扩展功能

一个简单的例子可能是：

class Series:

    def head(self, n=5):
        """
        Return the first elements of the Series.

        This function is mainly useful to preview the values of the
        Series without displaying all of it.

        Parameters
        ----------
        n : int
            Number of values to return.

        Return
        ------
        pandas.Series
            Subset of the original series with the n first values.

        See Also
        --------
        tail : Return the last n elements of the Series.

        Examples
        --------
        >>> ser = pd.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
        ...                'Lion', 'Monkey', 'Rabbit', 'Zebra'])
        >>> ser.head()
        0   Ant
        1   Bear
        2   Cow
        3   Dog
        4   Falcon
        dtype: object

        With the ``n`` parameter, we can change the number of returned rows:

        >>> ser.head(n=3)
        0   Ant
        1   Bear
        2   Cow
        dtype: object
        """
        return self.iloc[:n]

示例应尽可能简洁。在函数复杂性需要长示例的情况下，建议使用带有粗体标题的块。使用双星号 ** 使文本加粗，如 **这个示例**。

示例的约定#

示例中的代码总是假设从以下两行开始，这两行未显示：

import numpy as np
import pandas as pd

在示例中使用的任何其他模块必须显式导入，每行一个（如 PEP 8#imports 中所推荐）并避免使用别名。避免过多的导入，但如果需要，标准库的导入放在首位，其次是第三方库（如 matplotlib）。

当用单个 Series 举例时，使用名称 ser，如果是用单个 DataFrame 举例，使用名称 df。对于索引，首选名称是 idx。如果使用一组同质的 Series 或 DataFrame，将它们命名为 ser1, ser2, ser3… 或 df1, df2, df3… 如果数据不是同质的，并且需要多个结构，将它们命名为有意义的名称，例如 df_main 和 df_to_join。

示例中使用的数据应尽可能紧凑。建议的行数约为4行，但要使其成为对特定示例有意义的数字。例如，在 head 方法中，它需要大于5，以显示使用默认值的示例。如果计算 mean，我们可以使用类似 [1, 2, 3] 的数据，这样很容易看出返回的值是平均值。

对于更复杂的示例（例如分组），避免使用未经解释的数据，如带有列A、B、C、D的随机数矩阵… 而应使用有意义的示例，这使得更容易理解概念。除非示例要求，否则使用动物名称，以保持示例的一致性。以及它们的数值属性。

调用方法时，关键字参数 head(n=3) 优于位置参数 head(3)。

好:

class Series:

    def mean(self):
        """
        Compute the mean of the input.

        Examples
        --------
        >>> ser = pd.Series([1, 2, 3])
        >>> ser.mean()
        2
        """
        pass


    def fillna(self, value):
        """
        Replace missing values by ``value``.

        Examples
        --------
        >>> ser = pd.Series([1, np.nan, 3])
        >>> ser.fillna(0)
        [1, 0, 3]
        """
        pass

    def groupby_mean(self):
        """
        Group by index and return mean.

        Examples
        --------
        >>> ser = pd.Series([380., 370., 24., 26],
        ...               name='max_speed',
        ...               index=['falcon', 'falcon', 'parrot', 'parrot'])
        >>> ser.groupby_mean()
        index
        falcon    375.0
        parrot     25.0
        Name: max_speed, dtype: float64
        """
        pass

    def contains(self, pattern, case_sensitive=True, na=numpy.nan):
        """
        Return whether each value contains ``pattern``.

        In this case, we are illustrating how to use sections, even
        if the example is simple enough and does not require them.

        Examples
        --------
        >>> ser = pd.Series('Antelope', 'Lion', 'Zebra', np.nan)
        >>> ser.contains(pattern='a')
        0    False
        1    False
        2     True
        3      NaN
        dtype: bool

        **Case sensitivity**

        With ``case_sensitive`` set to ``False`` we can match ``a`` with both
        ``a`` and ``A``:

        >>> s.contains(pattern='a', case_sensitive=False)
        0     True
        1    False
        2     True
        3      NaN
        dtype: bool

        **Missing values**

        We can fill missing values in the output using the ``na`` parameter:

        >>> ser.contains(pattern='a', na=False)
        0    False
        1    False
        2     True
        3    False
        dtype: bool
        """
        pass

错误：

def method(foo=None, bar=None):
    """
    A sample DataFrame method.

    Do not import NumPy and pandas.

    Try to use meaningful data, when it makes the example easier
    to understand.

    Try to avoid positional arguments like in ``df.method(1)``. They
    can be all right if previously defined with a meaningful name,
    like in ``present_value(interest_rate)``, but avoid them otherwise.

    When presenting the behavior with different parameters, do not place
    all the calls one next to the other. Instead, add a short sentence
    explaining what the example shows.

    Examples
    --------
    >>> import numpy as np
    >>> import pandas as pd
    >>> df = pd.DataFrame(np.random.randn(3, 3),
    ...                   columns=('a', 'b', 'c'))
    >>> df.method(1)
    21
    >>> df.method(bar=14)
    123
    """
    pass

关于使您的示例通过 doctests 的提示#

在验证脚本中通过doctests的示例有时可能会很棘手。以下是一些需要注意的点：

导入所有需要的库（除了 pandas 和 NumPy，这些已经作为 import pandas as pd 和 import numpy as np 导入）并定义你在示例中使用的所有变量。
尽量避免使用随机数据。然而，在某些情况下，随机数据可能是可以接受的，例如，如果您正在记录的函数涉及概率分布，或者使函数结果有意义所需的数据量太大，以至于手动创建非常麻烦。在这些情况下，始终使用固定的随机种子，以使生成的示例可预测。例如:
```
>>> np.random.seed(42)
>>> df = pd.DataFrame({'normal': np.random.normal(100, 5, 20)})
```

如果你有一个多行代码片段，你需要在续行上使用 ‘…’：：

>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], index=['a', 'b', 'c'],
...                   columns=['A', 'B'])

如果你想展示一个引发异常的案例，你可以这样做:
```
>>> pd.to_datetime(["712-01-01"])
Traceback (most recent call last):
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 712-01-01 00:00:00
```
包含“Traceback (most recent call last):”是必要的，但对于实际错误，只需错误名称就足够了。
如果结果的一小部分可以变化（例如对象表示中的哈希），你可以使用 ... 来表示这部分。

如果你想展示 s.plot() 返回一个 matplotlib AxesSubplot 对象，这将导致 doctest 失败
```
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd0c0b0690>
```
然而，你可以这样做（注意需要添加的注释）
```
>>> s.plot()  
<matplotlib.axes._subplots.AxesSubplot at ...>
```

示例中的图表#

在 pandas 中有一些方法返回图表。为了在文档中呈现由示例生成的图表，存在 .. plot:: 指令。

要使用它，请将以下代码放在“示例”标题之后，如下所示。构建文档时，图表将自动生成。

class Series:
    def plot(self):
        """
        Generate a plot with the ``Series`` data.

        Examples
        --------

        .. plot::
            :context: close-figs

            >>> ser = pd.Series([1, 2, 3])
            >>> ser.plot()
        """
        pass

pandas 文档字符串指南#

关于文档字符串和标准#

编写一个文档字符串#

一般规则#

第1节：简要总结#

第2节：扩展摘要#

第3节：参数#

参数类型#

第4节：返回或产生#

第5节：另请参见#

第6节：注释#

第7节：示例#

示例的约定#

关于使您的示例通过 doctests 的提示#

示例中的图表#

共享文档字符串#