MultiIndex / 高级索引#

本节涵盖 使用 MultiIndex 进行索引其他高级索引功能

请参阅 索引和选择数据 以获取一般索引文档。

请参阅 食谱 以获取一些高级策略。

分层索引 (MultiIndex)#

分层/多级索引非常令人兴奋,因为它为一些相当复杂的数据分析和操作打开了大门,特别是对于处理高维数据。本质上,它使您能够在低维数据结构中存储和操作具有任意数量维度的数据,如 ``Series``(1维)和 ``DataFrame``(2维)。

在本节中,我们将展示“分层”索引的确切含义,以及它如何与上述和之前章节中描述的所有 pandas 索引功能集成。稍后,在讨论 分组数据透视和重塑 时,我们将展示非平凡的应用程序,以说明它如何有助于为分析构建数据。

请参阅 食谱 以获取一些高级策略。

创建一个 MultiIndex(分层索引)对象#

The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]: arrays = [
   ...:     ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ...:     ["one", "two", "one", "two", "one", "two", "one", "two"],
   ...: ]
   ...: 

In [2]: tuples = list(zip(*arrays))

In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index
Out[5]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64

当你想要两个可迭代对象中每个元素的所有配对时,使用 MultiIndex.from_product() 方法会更简单:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"])
Out[9]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

你也可以直接从一个 DataFrame 构建一个 MultiIndex,使用方法 MultiIndex.from_frame()。这是与 MultiIndex.to_frame() 互补的方法。

In [10]: df = pd.DataFrame(
   ....:     [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
   ....:     columns=["first", "second"],
   ....: )
   ....: 

In [11]: pd.MultiIndex.from_frame(df)
Out[11]: 
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

为了方便起见,你可以直接将一个数组列表传递给 SeriesDataFrame 来自动构造一个 MultiIndex

In [12]: arrays = [
   ....:     np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
   ....:     np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
   ....: ]
   ....: 

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s
Out[14]: 
bar  one   -0.861849
     two   -2.104569
baz  one   -0.494929
     two    1.071804
foo  one    0.721555
     two   -0.706771
qux  one   -1.039575
     two    0.271860
dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df
Out[16]: 
                0         1         2         3
bar one -0.424972  0.567020  0.276232 -1.087401
    two -0.673690  0.113648 -1.478427  0.524988
baz one  0.404705  0.577046 -1.715002 -1.039268
    two -0.370647 -1.157892 -1.344312  0.844885
foo one  1.075770 -0.109050  1.643563 -1.469388
    two  0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524  0.413738  0.276662 -0.472035
    two -0.013960 -0.362543 -0.006154 -0.923061

所有 MultiIndex 构造函数都接受一个 names 参数,该参数存储级别本身的字符串名称。如果没有提供名称,将分配 None

In [17]: df.index.names
Out[17]: FrozenList([None, None])

这个索引可以支持任何 pandas 对象的轴,并且索引的 层级 数量由你决定:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df
Out[19]: 
first        bar                 baz                 foo                 qux          
second       one       two       one       two       one       two       one       two
A       0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309 -1.170299 -0.226169
B       0.410835  0.813850  0.132003 -0.827317 -0.076467 -1.187678  1.130127 -1.436737
C      -1.413681  1.607920  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])
Out[20]: 
first              bar                 baz                 foo          
second             one       two       one       two       one       two
first second                                                            
bar   one    -0.410001 -0.078638  0.545952 -1.219217 -1.226825  0.769804
      two    -1.281247 -0.727707 -0.121306 -0.097883  0.695775  0.341734
baz   one     0.959726 -1.110336 -0.619976  0.149748 -0.732339  0.687738
      two     0.176444  0.403310 -0.154951  0.301624 -2.179861 -1.369849
foo   one    -0.954208  1.462696 -1.743161 -0.826591 -0.345352  1.314232
      two     0.690579  0.995761  2.396780  0.014871  3.357427 -0.317441

我们已经对索引的高层进行了“稀疏化”处理,以使控制台输出看起来更舒适。请注意,索引的显示方式可以通过 pandas.set_options() 中的 multi_sparse 选项进行控制:

In [21]: with pd.option_context("display.multi_sparse", False):
   ....:     df
   ....: 

值得记住的是,没有什么能阻止你在轴上使用元组作为原子标签:

In [22]: pd.Series(np.random.randn(8), index=tuples)
Out[22]: 
(bar, one)   -1.236269
(bar, two)    0.896171
(baz, one)   -0.487602
(baz, two)   -0.082240
(foo, one)   -2.182937
(foo, two)    0.380396
(qux, one)    0.084844
(qux, two)    0.432390
dtype: float64

MultiIndex 之所以重要,是因为它可以允许你进行分组、选择和重塑操作,我们将在下面和文档的后续部分进行描述。正如你将在后面的章节中看到的,你可能会发现自己处理的是分层索引的数据,而无需显式创建 MultiIndex。然而,当从文件加载数据时,你可能希望在准备数据集时生成自己的 MultiIndex

重建层级标签#

方法 get_level_values() 将返回一个向量,该向量包含特定级别每个位置的标签:

In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

使用 MultiIndex 在轴上进行基本索引#

层次索引的一个重要特性是你可以通过一个标识数据子组的“部分”标签来选择数据。**部分**选择“丢弃”结果中层次索引的级别,这与在常规DataFrame中选择列的方式完全类似:

In [25]: df["bar"]
Out[25]: 
second       one       two
A       0.895717  0.805244
B       0.410835  0.813850
C      -1.413681  1.607920

In [26]: df["bar", "one"]
Out[26]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"]
Out[27]: 
A    0.895717
B    0.410835
C   -1.413681
Name: one, dtype: float64

In [28]: s["qux"]
Out[28]: 
one   -1.039575
two    0.271860
dtype: float64

请参阅 带有层次索引的横截面 以了解如何在更深层次上进行选择。

定义的级别#

MultiIndex 保留了索引的所有定义级别,即使它们实际上未被使用。在切片索引时,您可能会注意到这一点。例如:

In [29]: df.columns.levels  # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels  # sliced
Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

这样做是为了避免重新计算级别,以使切片操作非常高效。如果你想查看仅使用的级别,可以使用 get_level_values() 方法。

In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]: 
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

# for a specific level
In [32]: df[["foo", "qux"]].columns.get_level_values(0)
Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

要仅使用已用级别重建 MultiIndex ,可以使用 remove_unused_levels() 方法。

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

数据对齐和使用 reindex#

具有不同索引对象之间的操作,其轴上有 MultiIndex ,将按您预期的方式工作;数据对齐将与元组索引的工作方式相同:

In [35]: s + s[:-2]
Out[35]: 
bar  one   -1.723698
     two   -4.209138
baz  one   -0.989859
     two    2.143608
foo  one    1.443110
     two   -1.413542
qux  one         NaN
     two         NaN
dtype: float64

In [36]: s + s[::2]
Out[36]: 
bar  one   -1.723698
     two         NaN
baz  one   -0.989859
     two         NaN
foo  one    1.443110
     two         NaN
qux  one   -2.079150
     two         NaN
dtype: float64

reindex() 方法 Series/DataFrames 可以与另一个 MultiIndex 一起调用,或者甚至是元组列表或数组:

In [37]: s.reindex(index[:3])
Out[37]: 
first  second
bar    one      -0.861849
       two      -2.104569
baz    one      -0.494929
dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])
Out[38]: 
foo  two   -0.706771
bar  one   -0.861849
qux  one   -1.039575
baz  one   -0.494929
dtype: float64

使用分层索引的高级索引#

在高级索引中语法上集成 MultiIndex.loc 有点挑战性,但我们已经尽力做到了。一般来说,MultiIndex 键采用元组的形式。例如,以下操作会如你所预期的那样工作:

In [39]: df = df.T

In [40]: df
Out[40]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")]
Out[41]: 
A    0.805244
B    0.813850
C    1.607920
Name: (bar, two), dtype: float64

请注意,在这个例子中 df.loc['bar', 'two'] 也可以工作,但这种简写符号在一般情况下可能会导致歧义。

如果你想使用 .loc 索引一个特定的列,你必须使用一个元组,如下所示:

In [42]: df.loc[("bar", "two"), "A"]
Out[42]: 0.8052440253863785

你不必通过传递元组的第一个元素来指定 MultiIndex 的所有级别。例如,你可以使用“部分”索引来获取第一级中所有包含 bar 的元素,如下所示:

In [43]: df.loc["bar"]
Out[43]: 
               A         B         C
second                              
one     0.895717  0.410835 -1.413681
two     0.805244  0.813850  1.607920

这是稍微更详细的表示法 df.loc[('bar',),] 的快捷方式(在此示例中相当于 df.loc['bar',])。

“部分” 切片也工作得非常好。

In [44]: df.loc["baz":"foo"]
Out[44]: 
                     A         B         C
first second                              
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

你可以通过提供一个元组切片,用一个值的 ‘范围’ 进行切片。

In [45]: df.loc[("baz", "two"):("qux", "one")]
Out[45]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466

In [46]: df.loc[("baz", "two"):"foo"]
Out[46]: 
                     A         B         C
first second                              
baz   two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372

传递标签列表或元组的工作方式类似于重新索引:

In [47]: df.loc[[("bar", "two"), ("qux", "one")]]
Out[47]: 
                     A         B         C
first second                              
bar   two     0.805244  0.813850  1.607920
qux   one    -1.170299  1.130127  0.974466

备注

需要注意的是,在pandas中,元组和列表在索引时并不相同。元组被解释为一个多级键,而列表用于指定多个键。换句话说,元组横向移动(遍历层级),列表纵向移动(扫描层级)。

重要的是,一个元组列表索引了几个完整的 MultiIndex 键,而一个元组列表指的是一个级别内的几个值:

In [48]: s = pd.Series(
   ....:     [1, 2, 3, 4, 5, 6],
   ....:     index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
   ....: )
   ....: 

In [49]: s.loc[[("A", "c"), ("B", "d")]]  # list of tuples
Out[49]: 
A  c    1
B  d    5
dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists
Out[50]: 
A  c    1
   d    2
B  c    4
   d    5
dtype: int64

使用切片器#

你可以通过提供多个索引器来对 MultiIndex 进行切片。

你可以提供任何选择器,就像你按标签索引一样,参见 按标签选择 ,包括切片、标签列表、标签和布尔索引器。

你可以使用 slice(None) 来选择 级别的所有内容。你不需要指定所有 更深 的级别,它们将被隐含为 slice(None)

像往常一样,切片器的两侧 都包括在内,因为这是标签索引。

警告

你应该在 .loc 指定符中指定所有轴,这意味着用于 索引 和用于 的索引器。在某些情况下,传递的索引器可能会被误解为索引 两个 轴,而不是例如索引行的 MultiIndex

你应该这样做:

df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999

你不应该 这样做:

df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
In [51]: def mklbl(prefix, n):
   ....:     return ["%s%s" % (prefix, i) for i in range(n)]
   ....: 

In [52]: miindex = pd.MultiIndex.from_product(
   ....:     [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
   ....: )
   ....: 

In [53]: micolumns = pd.MultiIndex.from_tuples(
   ....:     [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
   ....: )
   ....: 

In [54]: dfmi = (
   ....:     pd.DataFrame(
   ....:         np.arange(len(miindex) * len(micolumns)).reshape(
   ....:             (len(miindex), len(micolumns))
   ....:         ),
   ....:         index=miindex,
   ....:         columns=micolumns,
   ....:     )
   ....:     .sort_index()
   ....:     .sort_index(axis=1)
   ....: )
   ....: 

In [55]: dfmi
Out[55]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0    9    8   11   10
         D1   13   12   15   14
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  237  236  239  238
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  249  248  251  250
         D1  253  252  255  254

[64 rows x 4 columns]

使用切片、列表和标签进行基本的多索引切片。

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]
Out[56]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A1 B0 C1 D0   73   72   75   74
         D1   77   76   79   78
      C3 D0   89   88   91   90
         D1   93   92   95   94
   B1 C1 D0  105  104  107  106
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[24 rows x 4 columns]

你可以使用 pandas.IndexSlice 来促进使用 : 而不是 slice(None) 的更自然的语法。

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[58]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

通过这种方法,可以在多个轴上同时执行相当复杂的选择。

In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]: 
lvl0        a    b
lvl1      foo  foo
B0 C0 D0   64   66
      D1   68   70
   C1 D0   72   74
      D1   76   78
   C2 D0   80   82
...       ...  ...
B1 C1 D1  108  110
   C2 D0  112  114
      D1  116  118
   C3 D0  120  122
      D1  124  126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]
Out[60]: 
lvl0           a    b
lvl1         foo  foo
A0 B0 C1 D0    8   10
         D1   12   14
      C3 D0   24   26
         D1   28   30
   B1 C1 D0   40   42
...          ...  ...
A3 B0 C3 D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

[32 rows x 2 columns]

使用布尔索引器,您可以提供与*值*相关的选择。

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]
Out[62]: 
lvl0           a    b
lvl1         foo  foo
A3 B0 C1 D1  204  206
      C3 D0  216  218
         D1  220  222
   B1 C1 D0  232  234
         D1  236  238
      C3 D0  248  250
         D1  252  254

你也可以为 .loc 指定 axis 参数,以在单个轴上解释传递的分片器。

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C1 D0    9    8   11   10
         D1   13   12   15   14
      C3 D0   25   24   27   26
         D1   29   28   31   30
   B1 C1 D0   41   40   43   42
...          ...  ...  ...  ...
A3 B0 C3 D1  221  220  223  222
   B1 C1 D0  233  232  235  234
         D1  237  236  239  238
      C3 D0  249  248  251  250
         D1  253  252  255  254

[32 rows x 4 columns]

此外,你可以使用以下方法 设置 值。

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2
Out[66]: 
lvl0           a         b     
lvl1         bar  foo  bah  foo
A0 B0 C0 D0    1    0    3    2
         D1    5    4    7    6
      C1 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10
      C2 D0   17   16   19   18
...          ...  ...  ...  ...
A3 B1 C1 D1  -10  -10  -10  -10
      C2 D0  241  240  243  242
         D1  245  244  247  246
      C3 D0  -10  -10  -10  -10
         D1  -10  -10  -10  -10

[64 rows x 4 columns]

你也可以使用一个可对齐对象的右侧。

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2
Out[69]: 
lvl0              a               b        
lvl1            bar     foo     bah     foo
A0 B0 C0 D0       1       0       3       2
         D1       5       4       7       6
      C1 D0    9000    8000   11000   10000
         D1   13000   12000   15000   14000
      C2 D0      17      16      19      18
...             ...     ...     ...     ...
A3 B1 C1 D1  237000  236000  239000  238000
      C2 D0     241     240     243     242
         D1     245     244     247     246
      C3 D0  249000  248000  251000  250000
         D1  253000  252000  255000  254000

[64 rows x 4 columns]

横截面#

xs() 方法 DataFrame 还接受一个 level 参数,以便更容易地在 MultiIndex 的特定级别选择数据。

In [70]: df
Out[70]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
      two     0.805244  0.813850  1.607920
baz   one    -1.206412  0.132003  1.024180
      two     2.565646 -0.827317  0.569605
foo   one     1.431256 -0.076467  0.875906
      two     1.340309 -1.187678 -2.211372
qux   one    -1.170299  1.130127  0.974466
      two    -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second")
Out[71]: 
              A         B         C
first                              
bar    0.895717  0.410835 -1.413681
baz   -1.206412  0.132003  1.024180
foo    1.431256 -0.076467  0.875906
qux   -1.170299  1.130127  0.974466
# using the slicers
In [72]: df.loc[(slice(None), "one"), :]
Out[72]: 
                     A         B         C
first second                              
bar   one     0.895717  0.410835 -1.413681
baz   one    -1.206412  0.132003  1.024180
foo   one     1.431256 -0.076467  0.875906
qux   one    -1.170299  1.130127  0.974466

你也可以通过提供轴参数,使用 xs 选择列。

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1)
Out[74]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466
# using the slicers
In [75]: df.loc[:, (slice(None), "one")]
Out[75]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

xs 也允许使用多个键进行选择。

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1)
Out[76]: 
first        bar
second       one
A       0.895717
B       0.410835
C      -1.413681
# using the slicers
In [77]: df.loc[:, ("bar", "one")]
Out[77]: 
A    0.895717
B    0.410835
C   -1.413681
Name: (bar, one), dtype: float64

你可以传递 drop_level=Falsexs 以保留所选的级别。

In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]: 
first        bar       baz       foo       qux
second       one       one       one       one
A       0.895717 -1.206412  1.431256 -1.170299
B       0.410835  0.132003 -0.076467  1.130127
C      -1.413681  1.024180  0.875906  0.974466

将上述内容与使用 ``drop_level=True``(默认值)的结果进行比较。

In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]: 
first       bar       baz       foo       qux
A      0.895717 -1.206412  1.431256 -1.170299
B      0.410835  0.132003 -0.076467  1.130127
C     -1.413681  1.024180  0.875906  0.974466

高级重新索引和排列#

在 pandas 对象的 reindex()align() 方法中使用 level 参数,有助于在某个级别上广播值。例如:

In [80]: midx = pd.MultiIndex(
   ....:     levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
   ....: )
   ....: 

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df
Out[82]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2
Out[84]: 
             0         1
one   1.060074 -0.109716
zero  1.271532  0.713416

In [85]: df2.reindex(df.index, level=0)
Out[85]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned
Out[87]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [88]: df2_aligned
Out[88]: 
               0         1
one  y  1.060074 -0.109716
     x  1.060074 -0.109716
zero y  1.271532  0.713416
     x  1.271532  0.713416

使用 swaplevel 交换级别#

swaplevel() 方法可以交换两个层级的顺序:

In [89]: df[:5]
Out[89]: 
               0         1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0)
Out[90]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

使用 reorder_levels 重新排序级别#

reorder_levels() 方法推广了 swaplevel 方法,允许您一步完成层次索引级别的排列:

In [91]: df[:5].reorder_levels([1, 0], axis=0)
Out[91]: 
               0         1
y one   1.519970 -0.493662
x one   0.600178  0.274230
y zero  0.132885 -0.023688
x zero  2.410179  1.450520

重命名 IndexMultiIndex 的名称#

rename() 方法用于重命名 MultiIndex 的标签,通常用于重命名 DataFrame 的列。renamecolumns 参数允许指定一个字典,该字典仅包含您希望重命名的列。

In [92]: df.rename(columns={0: "col0", 1: "col1"})
Out[92]: 
            col0      col1
one  y  1.519970 -0.493662
     x  0.600178  0.274230
zero y  0.132885 -0.023688
     x  2.410179  1.450520

此方法也可用于重命名 DataFrame 主索引的特定标签。

In [93]: df.rename(index={"one": "two", "y": "z"})
Out[93]: 
               0         1
two  z  1.519970 -0.493662
     x  0.600178  0.274230
zero z  0.132885 -0.023688
     x  2.410179  1.450520

rename_axis() 方法用于重命名 IndexMultiIndex 的名称。特别是,可以指定 MultiIndex 级别的名称,如果在之后使用 reset_index()MultiIndex 的值移动到列中,这将非常有用。

In [94]: df.rename_axis(index=["abc", "def"])
Out[94]: 
                 0         1
abc  def                    
one  y    1.519970 -0.493662
     x    0.600178  0.274230
zero y    0.132885 -0.023688
     x    2.410179  1.450520

注意,DataFrame 的列是一个索引,因此使用 rename_axis 并带有 columns 参数将更改该索引的名称。

In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')

renamerename_axis 都支持指定一个字典、Series 或映射函数来将标签/名称映射到新值。

当直接使用 Index 对象而不是通过 DataFrame 时,可以使用 Index.set_names() 来更改名称。

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2
Out[99]: 
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

你不能通过级别设置 MultiIndex 的名称。

In [100]: mi.levels[0].name = "name via level"
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[100], line 1
----> 1 mi.levels[0].name = "name via level"

File /home/pandas/pandas/core/indexes/base.py:1743, in Index.name(self, value)
   1739 @name.setter
   1740 def name(self, value: Hashable) -> None:
   1741     if self._no_setting_name:
   1742         # Used in MultiIndex.levels to avoid silently ignoring name updates.
-> 1743         raise RuntimeError(
   1744             "Cannot set name on a level of a MultiIndex. Use "
   1745             "'MultiIndex.set_names' instead."
   1746         )
   1747     maybe_extract_name(value, None, type(self))
   1748     self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

请使用 Index.set_names() 代替。

MultiIndex 进行排序#

对于 MultiIndex 索引的对象要有效地进行索引和切片,它们需要被排序。与任何索引一样,你可以使用 sort_index()

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s
Out[104]: 
bar  two    0.206053
qux  two   -0.251905
baz  two   -2.213588
foo  one    1.063327
qux  one    1.266143
foo  two    0.299368
baz  one   -0.863838
bar  one    0.408204
dtype: float64

In [105]: s.sort_index()
Out[105]: 
bar  one    0.408204
     two    0.206053
baz  one   -0.863838
     two   -2.213588
foo  one    1.063327
     two    0.299368
qux  one    1.266143
     two   -0.251905
dtype: float64

In [106]: s.sort_index(level=0)
Out[106]: 
bar  one    0.408204
     two    0.206053
baz  one   -0.863838
     two   -2.213588
foo  one    1.063327
     two    0.299368
qux  one    1.266143
     two   -0.251905
dtype: float64

In [107]: s.sort_index(level=1)
Out[107]: 
bar  one    0.408204
baz  one   -0.863838
foo  one    1.063327
qux  one    1.266143
bar  two    0.206053
baz  two   -2.213588
foo  two    0.299368
qux  two   -0.251905
dtype: float64

如果 MultiIndex 的层级有名称,您也可以将层级名称传递给 sort_index

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1")
Out[109]: 
L1   L2 
bar  one    0.408204
     two    0.206053
baz  one   -0.863838
     two   -2.213588
foo  one    1.063327
     two    0.299368
qux  one    1.266143
     two   -0.251905
dtype: float64

In [110]: s.sort_index(level="L2")
Out[110]: 
L1   L2 
bar  one    0.408204
baz  one   -0.863838
foo  one    1.063327
qux  one    1.266143
bar  two    0.206053
baz  two   -2.213588
foo  two    0.299368
qux  two   -0.251905
dtype: float64

在更高维的对象上,如果它们有 MultiIndex ,你可以按级别对任何其他轴进行排序:

In [111]: df.T.sort_index(level=1, axis=1)
Out[111]: 
        one      zero       one      zero
          x         x         y         y
0  0.600178  2.410179  1.519970  0.132885
1  0.274230  1.450520 -0.493662 -0.023688

即使数据未排序,索引也会工作,但效率会相当低(并显示 PerformanceWarning)。它还将返回数据的副本,而不是视图:

In [112]: dfm = pd.DataFrame(
   .....:     {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
   .....: )
   .....: 

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm
Out[114]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   z    0.537020
    y    0.110968

In [115]: dfm.loc[(1, 'z')]
Out[115]: 
           jolie
jim joe         
1   z    0.53702

此外,如果你尝试索引一些没有完全词法排序的东西,这可能会引发:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]
---------------------------------------------------------------------------
UnsortedIndexError                        Traceback (most recent call last)
Cell In[116], line 1
----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File /home/pandas/pandas/core/indexing.py:1195, in _LocationIndexer.__getitem__(self, key)
   1193 maybe_callable = com.apply_if_callable(key, self.obj)
   1194 maybe_callable = self._raise_callable_usage(key, maybe_callable)
-> 1195 return self._getitem_axis(maybe_callable, axis=axis)

File /home/pandas/pandas/core/indexing.py:1415, in _LocIndexer._getitem_axis(self, key, axis)
   1413 if isinstance(key, slice):
   1414     self._validate_key(key, axis)
-> 1415     return self._get_slice_axis(key, axis=axis)
   1416 elif com.is_bool_indexer(key):
   1417     return self._getbool_axis(key, axis=axis)

File /home/pandas/pandas/core/indexing.py:1447, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1444     return obj.copy(deep=False)
   1446 labels = obj._get_axis(axis)
-> 1447 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1449 if isinstance(indexer, slice):
   1450     return self.obj._slice(indexer, axis=axis)

File /home/pandas/pandas/core/indexes/base.py:6524, in Index.slice_indexer(self, start, end, step)
   6473 def slice_indexer(
   6474     self,
   6475     start: Hashable | None = None,
   6476     end: Hashable | None = None,
   6477     step: int | None = None,
   6478 ) -> slice:
   6479     """
   6480     Compute the slice indexer for input labels and step.
   6481 
   (...)
   6522     slice(1, 3, None)
   6523     """
-> 6524     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6526     # return a slice
   6527     if not is_scalar(start_slice):

File /home/pandas/pandas/core/indexes/multi.py:2998, in MultiIndex.slice_locs(self, start, end, step)
   2945 """
   2946 For an ordered MultiIndex, compute the slice locations for input
   2947 labels.
   (...)
   2994                       sequence of such.
   2995 """
   2996 # This function adds nothing to its parent implementation (the magic
   2997 # happens in get_slice_bound method), but it adds meaningful doc.
-> 2998 return super().slice_locs(start, end, step)

File /home/pandas/pandas/core/indexes/base.py:6755, in Index.slice_locs(self, start, end, step)
   6753 start_slice = None
   6754 if start is not None:
-> 6755     start_slice = self.get_slice_bound(start, "left")
   6756 if start_slice is None:
   6757     start_slice = 0

File /home/pandas/pandas/core/indexes/multi.py:2942, in MultiIndex.get_slice_bound(self, label, side)
   2940 if not isinstance(label, tuple):
   2941     label = (label,)
-> 2942 return self._partial_tup_index(label, side=side)

File /home/pandas/pandas/core/indexes/multi.py:3002, in MultiIndex._partial_tup_index(self, tup, side)
   3000 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"):
   3001     if len(tup) > self._lexsort_depth:
-> 3002         raise UnsortedIndexError(
   3003             f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth "
   3004             f"({self._lexsort_depth})"
   3005         )
   3007     n = len(tup)
   3008     start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

MultiIndex 上的 is_monotonic_increasing() 方法显示索引是否已排序:

In [117]: dfm.index.is_monotonic_increasing
Out[117]: False
In [118]: dfm = dfm.sort_index()

In [119]: dfm
Out[119]: 
            jolie
jim joe          
0   x    0.490671
    x    0.120248
1   y    0.110968
    z    0.537020

In [120]: dfm.index.is_monotonic_increasing
Out[120]: True

现在选择功能按预期工作。

In [121]: dfm.loc[(0, "y"):(1, "z")]
Out[121]: 
            jolie
jim joe          
1   y    0.110968
    z    0.537020

方法#

类似于 NumPy 的 ndarrays,pandas 的 IndexSeriesDataFrame 也提供了 take() 方法,该方法在给定的轴上检索给定索引处的元素。给定的索引必须是整数索引位置的列表或 ndarray。take 还将接受负整数作为对象末尾的相对位置。

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index
Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions]
Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions)
Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions]
Out[128]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

In [129]: ser.take(positions)
Out[129]: 
0   -0.179666
9    1.824375
3    0.392149
dtype: float64

对于DataFrame,给定的索引应该是指定行或列位置的一维列表或ndarray。

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3])
Out[131]: 
          0         1         2
1 -1.237881  0.106854 -1.276829
4  0.629675 -1.425966  1.857704
3  0.979542 -1.633678  0.615855

In [132]: frm.take([0, 2], axis=1)
Out[132]: 
          0         2
0  0.595974  0.601544
1 -1.237881 -1.276829
2 -0.767101  1.499591
3  0.979542  0.615855
4  0.629675  1.857704

需要注意的是,pandas 对象上的 take 方法并不打算用于布尔索引,可能会返回意外的结果。

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True])
Out[134]: array([-1.1935, -1.1935,  0.6775,  0.6775])

In [135]: arr[[0, 1]]
Out[135]: array([-1.1935,  0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True])
Out[137]: 
0    0.233141
0    0.233141
1   -0.223540
1   -0.223540
dtype: float64

In [138]: ser.iloc[[0, 1]]
Out[138]: 
0    0.233141
1   -0.223540
dtype: float64

最后,关于性能的一个小提示,因为 take 方法处理的范围更窄,它可以提供比花式索引快得多的性能。

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer]
   .....: %timeit arr.take(indexer, axis=0)
   .....: 
113 us +- 10.2 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
29.8 us +- 2.38 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer]
   .....: %timeit ser.take(indexer)
   .....: 
93.3 us +- 14.3 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)
89.9 us +- 14.1 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

索引类型#

我们在前面的章节中已经相当详细地讨论了 MultiIndex。关于 DatetimeIndexPeriodIndex 的文档显示在 这里,而关于 TimedeltaIndex 的文档可以在 这里 找到。

在以下小节中,我们将重点介绍一些其他索引类型。

CategoricalIndex#

CategoricalIndex 是一种索引类型,对于支持带有重复项的索引非常有用。这是一个围绕 Categorical 的容器,并允许高效地索引和存储具有大量重复元素的索引。

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df
Out[148]: 
   A  B
0  0  a
1  1  a
2  2  b
3  3  b
4  4  c
5  5  a

In [149]: df.dtypes
Out[149]: 
A       int64
B    category
dtype: object

In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')

设置索引将创建一个 CategoricalIndex

In [151]: df2 = df.set_index("B")

In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

使用 __getitem__/.iloc/.loc 进行索引与带有重复项的 Index 类似。索引器 必须 在类别中,否则操作将引发 KeyError

In [153]: df2.loc["a"]
Out[153]: 
   A
B   
a  0
a  1
a  5

CategoricalIndex 在索引后 保留

In [154]: df2.loc["a"].index
Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

排序索引将按照类别顺序进行排序(回想一下,我们是用 CategoricalDtype(list('cab')) 创建的索引,所以排序顺序是 cab)。

In [155]: df2.sort_index()
Out[155]: 
   A
B   
c  4
a  0
a  1
a  5
b  2
b  3

索引上的分组操作也将保留索引的性质。

In [156]: df2.groupby(level=0, observed=True).sum()
Out[156]: 
   A
B   
c  4
a  6
b  5

In [157]: df2.groupby(level=0, observed=True).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

重新索引操作将根据传递的索引器类型返回一个结果索引。传递一个列表将返回一个普通的 Index;使用 Categorical 进行索引将返回一个 CategoricalIndex,根据 传递Categorical 数据类型的类别进行索引。这允许您任意索引这些值,即使是类别中 存在的值,类似于您可以重新索引 任何 pandas 索引的方式。

In [158]: df3 = pd.DataFrame(
   .....:     {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
   .....: )
   .....: 

In [159]: df3 = df3.set_index("B")

In [160]: df3
Out[160]: 
   A
B   
a  0
b  1
c  2
In [161]: df3.reindex(["a", "e"])
Out[161]: 
     A
B     
a  0.0
e  NaN

In [162]: df3.reindex(["a", "e"]).index
Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))
Out[163]: 
     A
B     
a  0.0
e  NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index
Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

警告

CategoricalIndex 进行重塑和比较操作时,必须具有相同的类别,否则会引发 TypeError

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')
In [173]: pd.concat([df4, df5])
Out[173]: 
   A
B   
b  0
a  1
b  0
c  1

RangeIndex#

RangeIndexIndex 的一个子类,为所有 DataFrameSeries 对象提供默认索引。RangeIndexIndex 的一个优化版本,可以表示单调有序集合。这些类似于 Python range 类型。一个 RangeIndex 将始终具有 int64 dtype。

In [174]: idx = pd.RangeIndex(5)

In [175]: idx
Out[175]: RangeIndex(start=0, stop=5, step=1)

RangeIndex 是所有 DataFrameSeries 对象的默认索引:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index
Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index
Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns
Out[180]: RangeIndex(start=0, stop=2, step=1)

一个 RangeIndex 的行为类似于一个带有 int64 dtype 的 Index ,并且对 RangeIndex 的操作,其结果不能由 RangeIndex 表示,但应该具有整数 dtype 的,将被转换为带有 int64Index 。例如:

In [181]: idx[[0, 2]]
Out[181]: RangeIndex(start=0, stop=4, step=2)

IntervalIndex#

IntervalIndex 与其自己的数据类型 IntervalDtype 以及 Interval 标量类型一起,允许 pandas 对区间表示法提供一流的支持。

IntervalIndex 允许一些独特的索引,并且也用作 cut()qcut() 中类别的返回类型。

使用 IntervalIndex 进行索引#

一个 IntervalIndex 可以在 SeriesDataFrame 中用作索引。

In [182]: df = pd.DataFrame(
   .....:     {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
   .....: )
   .....: 

In [183]: df
Out[183]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3
(3, 4]  4

通过 .loc 基于标签的索引在区间边缘的工作方式如你所预期,选择该特定区间。

In [184]: df.loc[2]
Out[184]: 
A    2
Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]]
Out[185]: 
        A
(1, 2]  2
(2, 3]  3

如果你选择一个 包含 在区间内的标签,这也会选择该区间。

In [186]: df.loc[2.5]
Out[186]: 
A    3
Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]]
Out[187]: 
        A
(2, 3]  3
(3, 4]  4

使用 Interval 进行选择将仅返回精确匹配。

In [188]: df.loc[pd.Interval(1, 2)]
Out[188]: 
A    2
Name: (1, 2], dtype: int64

尝试选择一个不完全包含在 IntervalIndex 中的 Interval 会引发 KeyError

In [189]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[189], line 1
----> 1 df.loc[pd.Interval(0.5, 2.5)]

File /home/pandas/pandas/core/indexing.py:1195, in _LocationIndexer.__getitem__(self, key)
   1193 maybe_callable = com.apply_if_callable(key, self.obj)
   1194 maybe_callable = self._raise_callable_usage(key, maybe_callable)
-> 1195 return self._getitem_axis(maybe_callable, axis=axis)

File /home/pandas/pandas/core/indexing.py:1435, in _LocIndexer._getitem_axis(self, key, axis)
   1433 # fall thru to straight lookup
   1434 self._validate_key(key, axis)
-> 1435 return self._get_label(key, axis=axis)

File /home/pandas/pandas/core/indexing.py:1385, in _LocIndexer._get_label(self, label, axis)
   1383 def _get_label(self, label, axis: AxisInt):
   1384     # GH#5567 this will fail if the label is not present in the axis.
-> 1385     return self.obj.xs(label, axis=axis)

File /home/pandas/pandas/core/generic.py:4138, in NDFrame.xs(self, key, axis, level, drop_level)
   4136             new_index = index[loc]
   4137 else:
-> 4138     loc = index.get_loc(key)
   4140     if isinstance(loc, np.ndarray):
   4141         if loc.dtype == np.bool_:

File /home/pandas/pandas/core/indexes/interval.py:691, in IntervalIndex.get_loc(self, key)
    689 matches = mask.sum()
    690 if matches == 0:
--> 691     raise KeyError(key)
    692 if matches == 1:
    693     return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right')

选择所有与给定 Interval 重叠的 Intervals 可以使用 overlaps() 方法来创建一个布尔索引器。

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr
Out[191]: array([ True,  True,  True, False])

In [192]: df[idxr]
Out[192]: 
        A
(0, 1]  1
(1, 2]  2
(2, 3]  3

使用 cutqcut 进行数据分箱#

cut()qcut() 都返回一个 Categorical 对象,它们创建的箱子作为 IntervalIndex 存储在其 .categories 属性中。

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c
Out[194]: 
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories
Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut() 也接受其 bins 参数的 IntervalIndex ,这启用了一个有用的 pandas 惯用法。首先,我们用一些数据和设置为固定数量的 bins 调用 cut() ,以生成箱子。然后,我们将 .categories 的值作为 bins 参数在后续调用 cut() 时传递,提供将被分到相同箱子的新数据。

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories)
Out[196]: 
[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

任何落在所有箱子之外的值将被赋予一个 NaN 值。

生成间隔范围#

如果我们需要定期频率的区间,我们可以使用 interval_range() 函数通过各种 startendperiods 的组合来创建一个 IntervalIndexinterval_range 的默认频率对于数值区间是 1,对于类似日期时间的区间是日历日:

In [197]: pd.interval_range(start=0, end=5)
Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)
Out[198]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
               (2017-01-02 00:00:00, 2017-01-03 00:00:00],
               (2017-01-03 00:00:00, 2017-01-04 00:00:00],
               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3)
Out[199]: 
IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
               (1 days 00:00:00, 2 days 00:00:00],
               (2 days 00:00:00, 3 days 00:00:00]],
              dtype='interval[timedelta64[ns], right]')

freq 参数可以用来指定非默认频率,并且可以利用各种 频率别名 与类似日期时间的间隔:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5)
Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")
Out[201]: 
IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
               (2017-01-08 00:00:00, 2017-01-15 00:00:00],
               (2017-01-15 00:00:00, 2017-01-22 00:00:00],
               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")
Out[202]: 
IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
               (0 days 09:00:00, 0 days 18:00:00],
               (0 days 18:00:00, 1 days 03:00:00]],
              dtype='interval[timedelta64[ns], right]')

此外,可以使用 closed 参数来指定区间在哪一侧是闭合的。区间默认在右侧闭合。

In [203]: pd.interval_range(start=0, end=4, closed="both")
Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither")
Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

指定 startendperiods 将生成从 startend 的均匀间隔区间,结果 IntervalIndex 中包含 periods 个元素:

In [205]: pd.interval_range(start=0, end=6, periods=4)
Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)
Out[206]: 
IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
               (2018-01-20 08:00:00, 2018-02-08 16:00:00],
               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
              dtype='interval[datetime64[ns], right]')

杂项索引常见问题解答#

整数索引#

基于标签的整数轴标签索引是一个棘手的话题。它已经在邮件列表和科学Python社区的各个成员之间进行了广泛的讨论。在pandas中,我们的普遍观点是标签比整数位置更重要。因此,使用整数轴索引时,*仅*标签索引可以通过标准工具如 .loc 进行。以下代码将生成异常:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File /home/pandas/pandas/core/indexes/range.py:424, in RangeIndex.get_loc(self, key)
    423 try:
--> 424     return self._range.index(new_key)
    425 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], line 1
----> 1 s[-1]

File /home/pandas/pandas/core/series.py:903, in Series.__getitem__(self, key)
    898     key = unpack_1tuple(key)
    900 elif key_is_scalar:
    901     # Note: GH#50617 in 3.0 we changed int key to always be treated as
    902     #  a label, matching DataFrame behavior.
--> 903     return self._get_value(key)
    905 # Convert generator to list before going through hashable part
    906 # (We will iterate through the generator there to check for slices)
    907 if is_iterator(key):

File /home/pandas/pandas/core/series.py:990, in Series._get_value(self, label, takeable)
    987     return self._values[label]
    989 # Similar to Index.get_value, but we do not fall back to positional
--> 990 loc = self.index.get_loc(label)
    992 if is_integer(loc):
    993     return self._values[loc]

File /home/pandas/pandas/core/indexes/range.py:426, in RangeIndex.get_loc(self, key)
    424         return self._range.index(new_key)
    425     except ValueError as err:
--> 426         raise KeyError(key) from err
    427 if isinstance(key, Hashable):
    428     raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df
Out[210]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

In [211]: df.loc[-2:]
Out[211]: 
          0         1         2         3
0 -0.435772 -1.188928 -0.808286 -0.284634
1 -1.815703  1.347213 -0.243487  0.514704
2  1.162969 -0.287725 -0.179734  0.993962
3 -0.212673  0.909872 -0.733333 -0.349893
4  0.456434 -0.306735  0.553396  0.166221

这一审慎的决定是为了防止歧义和微妙的错误(许多用户报告称,当API更改以停止“回退”到基于位置的索引时,发现了错误)。

非单调索引需要精确匹配#

如果 SeriesDataFrame 的索引是单调递增或递减的,那么基于标签的切片边界可以超出索引的范围,就像对一个普通的 Python list 进行切片索引一样。索引的单调性可以通过 is_monotonic_increasing()is_monotonic_decreasing() 属性进行测试。

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing
Out[213]: True

# no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:
In [214]: df.loc[0:4, :]
Out[214]: 
   data
2     0
3     1
3     2
4     3

# slice is are outside the index, so empty DataFrame is returned
In [215]: df.loc[13:15, :]
Out[215]: 
Empty DataFrame
Columns: [data]
Index: []

另一方面,如果索引不是单调的,那么两个切片边界必须是索引中的 唯一 成员。

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing
Out[217]: False

# OK because 2 and 4 are in the index
In [218]: df.loc[2:4, :]
Out[218]: 
   data
2     0
3     1
1     2
4     3
 # 0 is not in the index
In [219]: df.loc[0:4, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[219], line 1
----> 1 df.loc[0:4, :]

File /home/pandas/pandas/core/indexing.py:1188, in _LocationIndexer.__getitem__(self, key)
   1186     if self._is_scalar_access(key):
   1187         return self.obj._get_value(*key, takeable=self._takeable)
-> 1188     return self._getitem_tuple(key)
   1189 else:
   1190     # we by definition only have the 0th axis
   1191     axis = self.axis or 0

File /home/pandas/pandas/core/indexing.py:1381, in _LocIndexer._getitem_tuple(self, tup)
   1378 if self._multi_take_opportunity(tup):
   1379     return self._multi_take(tup)
-> 1381 return self._getitem_tuple_same_dim(tup)

File /home/pandas/pandas/core/indexing.py:1027, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1024 if com.is_null_slice(key):
   1025     continue
-> 1027 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1028 # We should never have retval.ndim < self.ndim, as that should
   1029 #  be handled by the _getitem_lowerdim call above.
   1030 assert retval.ndim == self.ndim

File /home/pandas/pandas/core/indexing.py:1415, in _LocIndexer._getitem_axis(self, key, axis)
   1413 if isinstance(key, slice):
   1414     self._validate_key(key, axis)
-> 1415     return self._get_slice_axis(key, axis=axis)
   1416 elif com.is_bool_indexer(key):
   1417     return self._getbool_axis(key, axis=axis)

File /home/pandas/pandas/core/indexing.py:1447, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1444     return obj.copy(deep=False)
   1446 labels = obj._get_axis(axis)
-> 1447 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1449 if isinstance(indexer, slice):
   1450     return self.obj._slice(indexer, axis=axis)

File /home/pandas/pandas/core/indexes/base.py:6524, in Index.slice_indexer(self, start, end, step)
   6473 def slice_indexer(
   6474     self,
   6475     start: Hashable | None = None,
   6476     end: Hashable | None = None,
   6477     step: int | None = None,
   6478 ) -> slice:
   6479     """
   6480     Compute the slice indexer for input labels and step.
   6481 
   (...)
   6522     slice(1, 3, None)
   6523     """
-> 6524     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6526     # return a slice
   6527     if not is_scalar(start_slice):

File /home/pandas/pandas/core/indexes/base.py:6755, in Index.slice_locs(self, start, end, step)
   6753 start_slice = None
   6754 if start is not None:
-> 6755     start_slice = self.get_slice_bound(start, "left")
   6756 if start_slice is None:
   6757     start_slice = 0

File /home/pandas/pandas/core/indexes/base.py:6669, in Index.get_slice_bound(self, label, side)
   6666         return self._searchsorted_monotonic(label, side)
   6667     except ValueError:
   6668         # raise the original KeyError
-> 6669         raise err from None
   6671 if isinstance(slc, np.ndarray):
   6672     # get_loc may return a boolean array, which
   6673     # is OK as long as they are representable by a slice.
   6674     assert is_bool_dtype(slc.dtype)

File /home/pandas/pandas/core/indexes/base.py:6663, in Index.get_slice_bound(self, label, side)
   6661 # we need to look up the label
   6662 try:
-> 6663     slc = self.get_loc(label)
   6664 except KeyError as err:
   6665     try:

File /home/pandas/pandas/core/indexes/base.py:3585, in Index.get_loc(self, key)
   3580     if isinstance(casted_key, slice) or (
   3581         isinstance(casted_key, abc.Iterable)
   3582         and any(isinstance(x, slice) for x in casted_key)
   3583     ):
   3584         raise InvalidIndexError(key) from err
-> 3585     raise KeyError(key) from err
   3586 except TypeError:
   3587     # If we have a listlike key, _check_indexing_error will raise
   3588     #  InvalidIndexError. Otherwise we fall through and re-raise
   3589     #  the TypeError.
   3590     self._check_indexing_error(key)

KeyError: 0

 # 3 is not a unique label
In [220]: df.loc[2:3, :]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[220], line 1
----> 1 df.loc[2:3, :]

File /home/pandas/pandas/core/indexing.py:1188, in _LocationIndexer.__getitem__(self, key)
   1186     if self._is_scalar_access(key):
   1187         return self.obj._get_value(*key, takeable=self._takeable)
-> 1188     return self._getitem_tuple(key)
   1189 else:
   1190     # we by definition only have the 0th axis
   1191     axis = self.axis or 0

File /home/pandas/pandas/core/indexing.py:1381, in _LocIndexer._getitem_tuple(self, tup)
   1378 if self._multi_take_opportunity(tup):
   1379     return self._multi_take(tup)
-> 1381 return self._getitem_tuple_same_dim(tup)

File /home/pandas/pandas/core/indexing.py:1027, in _LocationIndexer._getitem_tuple_same_dim(self, tup)
   1024 if com.is_null_slice(key):
   1025     continue
-> 1027 retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
   1028 # We should never have retval.ndim < self.ndim, as that should
   1029 #  be handled by the _getitem_lowerdim call above.
   1030 assert retval.ndim == self.ndim

File /home/pandas/pandas/core/indexing.py:1415, in _LocIndexer._getitem_axis(self, key, axis)
   1413 if isinstance(key, slice):
   1414     self._validate_key(key, axis)
-> 1415     return self._get_slice_axis(key, axis=axis)
   1416 elif com.is_bool_indexer(key):
   1417     return self._getbool_axis(key, axis=axis)

File /home/pandas/pandas/core/indexing.py:1447, in _LocIndexer._get_slice_axis(self, slice_obj, axis)
   1444     return obj.copy(deep=False)
   1446 labels = obj._get_axis(axis)
-> 1447 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step)
   1449 if isinstance(indexer, slice):
   1450     return self.obj._slice(indexer, axis=axis)

File /home/pandas/pandas/core/indexes/base.py:6524, in Index.slice_indexer(self, start, end, step)
   6473 def slice_indexer(
   6474     self,
   6475     start: Hashable | None = None,
   6476     end: Hashable | None = None,
   6477     step: int | None = None,
   6478 ) -> slice:
   6479     """
   6480     Compute the slice indexer for input labels and step.
   6481 
   (...)
   6522     slice(1, 3, None)
   6523     """
-> 6524     start_slice, end_slice = self.slice_locs(start, end, step=step)
   6526     # return a slice
   6527     if not is_scalar(start_slice):

File /home/pandas/pandas/core/indexes/base.py:6761, in Index.slice_locs(self, start, end, step)
   6759 end_slice = None
   6760 if end is not None:
-> 6761     end_slice = self.get_slice_bound(end, "right")
   6762 if end_slice is None:
   6763     end_slice = len(self)

File /home/pandas/pandas/core/indexes/base.py:6677, in Index.get_slice_bound(self, label, side)
   6675     slc = lib.maybe_booleans_to_slice(slc.view("u1"))
   6676     if isinstance(slc, np.ndarray):
-> 6677         raise KeyError(
   6678             f"Cannot get {side} slice bound for non-unique "
   6679             f"label: {original_label!r}"
   6680         )
   6682 if isinstance(slc, slice):
   6683     if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasingIndex.is_monotonic_decreasing 仅检查索引是否是弱单调的。要检查严格单调性,可以将其中一个与 is_unique() 属性结合使用。

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic
Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing
Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique
Out[224]: False

端点是包含的#

与切片终点不包含的标准 Python 序列切片相比,pandas 中的基于标签的切片 是包含的。这主要是因为在索引中通常不容易确定某个标签的“后继”或下一个元素。例如,考虑以下 Series

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s
Out[226]: 
a   -0.101684
b   -0.734907
c   -0.130121
d   -0.476046
e    0.759104
f    0.213379
dtype: float64

假设我们希望从 ce 进行切片,使用整数可以这样完成:

In [227]: s[2:5]
Out[227]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

然而,如果你只有 ce,确定索引中的下一个元素可能会有些复杂。例如,以下方法行不通:

In [228]: s.loc['c':'e' + 1]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[228], line 1
----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str

一个非常常见的用例是将时间序列限制在两个特定的日期开始和结束。为了实现这一点,我们做出了设计选择,使基于标签的切片包括两个端点:

In [229]: s.loc["c":"e"]
Out[229]: 
c   -0.130121
d   -0.476046
e    0.759104
dtype: float64

这绝对是一个“实用性胜过纯粹性”的事情,但如果你期望基于标签的分片行为与标准Python整数分片完全相同,这是需要注意的。

索引可能会改变底层 Series 的数据类型#

不同的索引操作可能会改变 Series 的 dtype。

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype
Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype
Out[233]: dtype('float64')

In [234]: res
Out[234]: 
0    1.0
4    NaN
dtype: float64
In [235]: series2 = pd.Series([True])

In [236]: series2.dtype
Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype
Out[238]: dtype('O')

In [239]: res
Out[239]: 
0    True
1     NaN
2     NaN
dtype: object

这是因为上述的(重新)索引操作会静默插入 NaNs 并且 dtype 会相应地改变。这在使用 numpyufuncs 例如 numpy.logical_and 时可能会引起一些问题。

有关更详细的讨论,请参见 GH 2388