处理文本数据#

文本数据类型#

在 pandas 中存储文本数据有两种方法：

object -dtype NumPy 数组。
StringDtype 扩展类型。

我们推荐使用 StringDtype 来存储文本数据。

在 pandas 1.0 之前，object dtype 是唯一的选择。这有很多不幸的原因：

你可能会意外地在 object dtype 数组中存储混合字符串和非字符串。最好有一个专门的 dtype。
object dtype 会破坏特定 dtype 的操作，例如 DataFrame.select_dtypes()。目前没有明确的方法来选择仅文本，同时排除非文本但仍然是 object-dtype 的列。
在阅读代码时，object dtype 数组的内容比 'string' 更不清晰。

目前，object dtype 的字符串数组和 arrays.StringArray 的性能大致相同。我们期望未来的增强能够显著提高 StringArray 的性能并降低内存开销。

警告

StringArray 目前被认为是实验性的。其实现和部分API可能会在没有警告的情况下发生变化。

为了向后兼容，object dtype 仍然是我们推断字符串列表的默认类型。

In [1]: pd.Series(["a", "b", "c"])
Out[1]: 
0    a
1    b
2    c
dtype: object

要显式请求 string 数据类型，请指定 dtype

In [2]: pd.Series(["a", "b", "c"], dtype="string")
Out[2]: 
0    a
1    b
2    c
dtype: string

In [3]: pd.Series(["a", "b", "c"], dtype=pd.StringDtype())
Out[3]: 
0    a
1    b
2    c
dtype: string

或者在创建 Series 或 DataFrame 之后使用 astype

In [4]: s = pd.Series(["a", "b", "c"])

In [5]: s
Out[5]: 
0    a
1    b
2    c
dtype: object

In [6]: s.astype("string")
Out[6]: 
0    a
1    b
2    c
dtype: string

你也可以在非字符串数据上使用 StringDtype/"string" 作为 dtype，它将被转换为 string dtype:

In [7]: s = pd.Series(["a", 2, np.nan], dtype="string")

In [8]: s
Out[8]: 
0       a
1       2
2    <NA>
dtype: string

In [9]: type(s[1])
Out[9]: str

或从现有的 pandas 数据转换：

In [10]: s1 = pd.Series([1, 2, np.nan], dtype="Int64")

In [11]: s1
Out[11]: 
0       1
1       2
2    <NA>
dtype: Int64

In [12]: s2 = s1.astype("string")

In [13]: s2
Out[13]: 
0       1
1       2
2    <NA>
dtype: string

In [14]: type(s2[0])
Out[14]: str

行为差异#

这些是 StringDtype 对象的行为与 object dtype 不同的地方。

对于 StringDtype，返回数值输出的字符串访问器方法将始终返回一个可空整数类型，而不是根据是否存在 NA 值返回 int 或 float 类型。返回布尔输出的方法将返回一个可空布尔类型。

In [15]: s = pd.Series(["a", None, "b"], dtype="string")

In [16]: s
Out[16]: 
0       a
1    <NA>
2       b
dtype: string

In [17]: s.str.count("a")
Out[17]: 
0       1
1    <NA>
2       0
dtype: Int64

In [18]: s.dropna().str.count("a")
Out[18]: 
0    1
2    0
dtype: Int64

两种输出都是 Int64 数据类型。与对象数据类型进行比较。

In [19]: s2 = pd.Series(["a", None, "b"], dtype="object")

In [20]: s2.str.count("a")
Out[20]: 
0    1.0
1    NaN
2    0.0
dtype: float64

In [21]: s2.dropna().str.count("a")
Out[21]: 
0    1
2    0
dtype: int64

当存在 NA 值时，输出数据类型是 float64。同样适用于返回布尔值的方法。

In [22]: s.str.isdigit()
Out[22]: 
0    False
1     <NA>
2    False
dtype: boolean

In [23]: s.str.match("a")
Out[23]: 
0     True
1     <NA>
2    False
dtype: boolean

一些字符串方法，如 Series.str.decode() 在 StringArray 上不可用，因为 StringArray 只包含字符串，不包含字节。
在比较操作中，arrays.StringArray 和由 StringArray 支持的 Series 将返回一个具有 BooleanDtype 的对象，而不是一个 bool dtype 对象。StringArray 中的缺失值将在比较操作中传播，而不是像 numpy.nan 那样总是不相等。

本文档其余部分中提到的所有内容同样适用于 string 和 object 数据类型。

字符串方法#

Series 和 Index 配备了一系列字符串处理方法，这些方法使得对数组中的每个元素进行操作变得容易。也许最重要的是，这些方法自动排除缺失/NA 值。这些方法通过 str 属性访问，并且通常与等效的（标量）内置字符串方法名称匹配：

In [24]: s = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   ....: )
   ....: 

In [25]: s.str.lower()
Out[25]: 
0       a
1       b
2       c
3    aaba
4    baca
5    <NA>
6    caba
7     dog
8     cat
dtype: string

In [26]: s.str.upper()
Out[26]: 
0       A
1       B
2       C
3    AABA
4    BACA
5    <NA>
6    CABA
7     DOG
8     CAT
dtype: string

In [27]: s.str.len()
Out[27]: 
0       1
1       1
2       1
3       4
4       4
5    <NA>
6       4
7       3
8       3
dtype: Int64

In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])

In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')

In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')

In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')

Index 上的字符串方法对于清理或转换 DataFrame 列特别有用。例如，您可能有带有前导或尾随空格的列：

In [32]: df = pd.DataFrame(
   ....:     np.random.randn(3, 2), columns=[" Column A ", " Column B "], index=range(3)
   ....: )
   ....: 

In [33]: df
Out[33]: 
   Column A   Column B 
0   0.469112  -0.282863
1  -1.509059  -1.135632
2   1.212112  -0.173215

由于 df.columns 是一个 Index 对象，我们可以使用 .str 访问器

In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')

In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')

这些字符串方法可以根据需要用于清理列。在这里，我们移除前导和尾随空白字符，将所有名称转为小写，并用下划线替换任何剩余的空白字符：

In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

In [37]: df
Out[37]: 
   column_a  column_b
0  0.469112 -0.282863
1 -1.509059 -1.135632
2  1.212112 -0.173215

备注

如果你有一个 Series ，其中许多元素是重复的（即 Series 中唯一元素的数量远小于 Series 的长度），可以更快地将原始 Series 转换为 category 类型，然后在该类型上使用 .str.<方法> 或 .dt.<属性> 。性能差异的原因在于，对于 category 类型的 Series ，字符串操作是在 .categories 上进行的，而不是在 Series 的每个元素上。

请注意，类型为 category 的 Series 与类型为字符串的 Series 相比有一些限制（例如，你不能将字符串相互添加：如果 s 是类型为 category 的 Series，则 s + " " + s 将无法工作）。此外，操作类型为 list 的元素的 .str 方法在此类 Series 上不可用。

警告

Series 的类型是推断出来的，并且是允许的类型之一（即字符串）。

一般来说，.str 访问器仅用于字符串。极少数情况下，其他用途是不支持的，并且可能在以后被禁用。

分割和替换字符串#

像 split 这样的方法返回一个列表的 Series：

In [38]: s2 = pd.Series(["a_b_c", "c_d_e", np.nan, "f_g_h"], dtype="string")

In [39]: s2.str.split("_")
Out[39]: 
0    [a, b, c]
1    [c, d, e]
2         <NA>
3    [f, g, h]
dtype: object

拆分列表中的元素可以使用 get 或 [] 符号访问：

In [40]: s2.str.split("_").str.get(1)
Out[40]: 
0       b
1       d
2    <NA>
3       g
dtype: object

In [41]: s2.str.split("_").str[1]
Out[41]: 
0       b
1       d
2    <NA>
3       g
dtype: object

使用 expand 可以轻松扩展以返回一个 DataFrame。

In [42]: s2.str.split("_", expand=True)
Out[42]: 
   1     2
   a     b     c
   c     d     e
<NA>  <NA>  <NA>
   f     g     h

当原始 Series 具有 StringDtype 时，输出列也将全部为 StringDtype。

也可以限制分割的数量：

In [43]: s2.str.split("_", expand=True, n=1)
Out[43]: 
   1
   a   b_c
   c   d_e
<NA>  <NA>
   f   g_h

rsplit 类似于 split ，除了它的工作方向相反，即从字符串的末尾到字符串的开头：

In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]: 
   1
 a_b     c
 c_d     e
<NA>  <NA>
 f_g     h

replace 可选地使用正则表达式:

In [45]: s3 = pd.Series(
   ....:     ["A", "B", "C", "Aaba", "Baca", "", np.nan, "CABA", "dog", "cat"],
   ....:     dtype="string",
   ....: )
   ....: 

In [46]: s3
Out[46]: 
0       A
1       B
2       C
3    Aaba
4    Baca
5        
6    <NA>
7    CABA
8     dog
9     cat
dtype: string

In [47]: s3.str.replace("^.a|dog", "XX-XX ", case=False, regex=True)
Out[47]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

在 2.0 版本发生变更.

带有 regex=True 的单字符模式也将被视为正则表达式：

In [48]: s4 = pd.Series(["a.b", ".", "b", np.nan, ""], dtype="string")

In [49]: s4
Out[49]: 
0     a.b
1       .
2       b
3    <NA>
4        
dtype: string

In [50]: s4.str.replace(".", "a", regex=True)
Out[50]: 
0     aaa
1       a
2       a
3    <NA>
4        
dtype: string

如果你想对一个字符串进行逐字替换（相当于 str.replace()），你可以将可选的 regex 参数设置为 False，而不是转义每个字符。在这种情况下，pat 和 repl 都必须是字符串：

In [51]: dollars = pd.Series(["12", "-$10", "$10,000"], dtype="string")

# These lines are equivalent
In [52]: dollars.str.replace(r"-\$", "-", regex=True)
Out[52]: 
0         12
1        -10
2    $10,000
dtype: string

In [53]: dollars.str.replace("-$", "-", regex=False)
Out[53]: 
0         12
1        -10
2    $10,000
dtype: string

replace 方法也可以接受一个可调用对象作为替换。它会在每个 pat 上调用 re.sub()。该可调用对象应接受一个位置参数（一个正则表达式对象）并返回一个字符串。

# Reverse every lowercase alphabetic word
In [54]: pat = r"[a-z]+"

In [55]: def repl(m):
   ....:     return m.group(0)[::-1]
   ....: 

In [56]: pd.Series(["foo 123", "bar baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[56]: 
0    oof 123
1    rab zab
2       <NA>
dtype: string

# Using regex groups
In [57]: pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)"

In [58]: def repl(m):
   ....:     return m.group("two").swapcase()
   ....: 

In [59]: pd.Series(["Foo Bar Baz", np.nan], dtype="string").str.replace(
   ....:     pat, repl, regex=True
   ....: )
   ....: 
Out[59]: 
0     bAR
1    <NA>
dtype: string

replace 方法也接受来自 re.compile() 的编译正则表达式对象作为模式。所有标志应包含在编译的正则表达式对象中。

In [60]: import re

In [61]: regex_pat = re.compile(r"^.a|dog", flags=re.IGNORECASE)

In [62]: s3.str.replace(regex_pat, "XX-XX ", regex=True)
Out[62]: 
0           A
1           B
2           C
3    XX-XX ba
4    XX-XX ca
5            
6        <NA>
7    XX-XX BA
8      XX-XX 
9     XX-XX t
dtype: string

在使用编译后的正则表达式对象调用 replace 时包含 flags 参数将引发 ValueError。

In [63]: s3.str.replace(regex_pat, 'XX-XX ', flags=re.IGNORECASE)
---------------------------------------------------------------------------
ValueError: case and flags cannot be set when pat is a compiled regex

removeprefix 和 removesuffix 的效果与在 Python 3.9 中添加的 str.removeprefix 和 str.removesuffix 相同 <https://docs.python.org/3/library/stdtypes.html#str.removeprefix>`__:

Added in version 1.4.0.

In [64]: s = pd.Series(["str_foo", "str_bar", "no_prefix"])

In [65]: s.str.removeprefix("str_")
Out[65]: 
0          foo
1          bar
2    no_prefix
dtype: object

In [66]: s = pd.Series(["foo_str", "bar_str", "no_suffix"])

In [67]: s.str.removesuffix("_str")
Out[67]: 
0          foo
1          bar
2    no_suffix
dtype: object

连接#

有几种方法可以连接 Series 或 Index ，无论是与自身还是其他对象，都基于 cat() ，相应地是 Index.str.cat。

将单个 Series 连接成字符串#

Series``（或 ``Index）的内容可以被连接：

In [68]: s = pd.Series(["a", "b", "c", "d"], dtype="string")

In [69]: s.str.cat(sep=",")
Out[69]: 'a,b,c,d'

如果未指定，分隔符的关键字 sep 默认为空字符串，sep='':

In [70]: s.str.cat()
Out[70]: 'abcd'

默认情况下，缺失值会被忽略。使用 na_rep，它们可以被赋予一个表示：

In [71]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")

In [72]: t.str.cat(sep=",")
Out[72]: 'a,b,d'

In [73]: t.str.cat(sep=",", na_rep="-")
Out[73]: 'a,b,-,d'

将一个 Series 和类似列表的对象连接成一个 Series#

cat() 的第一个参数可以是一个类似列表的对象，前提是它与调用 Series``（或 ``Index）的长度匹配。

In [74]: s.str.cat(["A", "B", "C", "D"])
Out[74]: 
0    aA
1    bB
2    cC
3    dD
dtype: string

无论哪一侧缺少值，结果中也会缺少值，除非指定了 na_rep：

In [75]: s.str.cat(t)
Out[75]: 
0      aa
1      bb
2    <NA>
3      dd
dtype: string

In [76]: s.str.cat(t, na_rep="-")
Out[76]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

将一个 Series 和类似数组的对象连接成一个 Series#

参数 others 也可以是二维的。在这种情况下，行数必须与调用 Series （或 Index）的长度匹配。

In [77]: d = pd.concat([t, s], axis=1)

In [78]: s
Out[78]: 
0    a
1    b
2    c
3    d
dtype: string

In [79]: d
Out[79]: 
      0  1
0     a  a
1     b  b
2  <NA>  c
3     d  d

In [80]: s.str.cat(d, na_rep="-")
Out[80]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

将一个序列和一个带索引的对象连接成一个序列，进行对齐#

对于与 Series 或 DataFrame 的连接，可以通过设置 join-关键字在连接之前对齐索引。

In [81]: u = pd.Series(["b", "d", "a", "c"], index=[1, 3, 0, 2], dtype="string")

In [82]: s
Out[82]: 
0    a
1    b
2    c
3    d
dtype: string

In [83]: u
Out[83]: 
1    b
3    d
0    a
2    c
dtype: string

In [84]: s.str.cat(u)
Out[84]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

In [85]: s.str.cat(u, join="left")
Out[85]: 
0    aa
1    bb
2    cc
3    dd
dtype: string

join 的常用选项可用（包括 'left', 'outer', 'inner', 'right' 之一）。特别是，对齐也意味着不同的长度不再需要一致。

In [86]: v = pd.Series(["z", "a", "b", "d", "e"], index=[-1, 0, 1, 3, 4], dtype="string")

In [87]: s
Out[87]: 
0    a
1    b
2    c
3    d
dtype: string

In [88]: v
Out[88]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [89]: s.str.cat(v, join="left", na_rep="-")
Out[89]: 
0    aa
1    bb
2    c-
3    dd
dtype: string

In [90]: s.str.cat(v, join="outer", na_rep="-")
Out[90]: 
-1    -z
 0    aa
 1    bb
 2    c-
 3    dd
 4    -e
dtype: string

当 others 是一个 DataFrame 时，可以使用相同的对齐方式：

In [91]: f = d.loc[[3, 2, 1, 0], :]

In [92]: s
Out[92]: 
0    a
1    b
2    c
3    d
dtype: string

In [93]: f
Out[93]: 
      0  1
3     d  d
2  <NA>  c
1     b  b
0     a  a

In [94]: s.str.cat(f, join="left", na_rep="-")
Out[94]: 
0    aaa
1    bbb
2    c-c
3    ddd
dtype: string

将一个序列和多个对象连接成一个序列#

几个类似数组的项目（具体来说：Series、Index 和 np.ndarray 的一维变体）可以组合在一个类似列表的容器中（包括迭代器、dict-视图等）。

In [95]: s
Out[95]: 
0    a
1    b
2    c
3    d
dtype: string

In [96]: u
Out[96]: 
1    b
3    d
0    a
2    c
dtype: string

In [97]: s.str.cat([u, u.to_numpy()], join="left")
Out[97]: 
0    aab
1    bbd
2    cca
3    ddc
dtype: string

在传递的类列表中所有没有索引的元素（例如 np.ndarray）必须与调用的 Series``（或 ``Index）长度匹配，但 Series 和 Index 可以有任意长度（只要没有用 join=None 禁用对齐）：

In [98]: v
Out[98]: 
-1    z
 0    a
 1    b
 3    d
 4    e
dtype: string

In [99]: s.str.cat([v, u, u.to_numpy()], join="outer", na_rep="-")
Out[99]: 
-1    -z--
0     aaab
1     bbbd
2     c-ca
3     dddc
4     -e--
dtype: string

如果在包含不同索引的 others 列表上使用 join='right' ，这些索引的并集将被用作最终连接的基础：

In [100]: u.loc[[3]]
Out[100]: 
3    d
dtype: string

In [101]: v.loc[[-1, 0]]
Out[101]: 
-1    z
 0    a
dtype: string

In [102]: s.str.cat([u.loc[[3]], v.loc[[-1, 0]]], join="right", na_rep="-")
Out[102]: 
 3    dd-
-1    --z
 0    a-a
dtype: string

使用 `.str` 进行索引#

你可以使用 [] 符号直接按位置索引。如果你索引超过字符串的末尾，结果将是 NaN。

In [103]: s = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [104]: s.str[0]
Out[104]: 
0       A
1       B
2       C
3       A
4       B
5    <NA>
6       C
7       d
8       c
dtype: string

In [105]: s.str[1]
Out[105]: 
0    <NA>
1    <NA>
2    <NA>
3       a
4       a
5    <NA>
6       A
7       o
8       a
dtype: string

提取子字符串#

提取每个主题中的第一个匹配项（extract）#

extract 方法接受一个正则表达式，其中至少包含一个捕获组。

提取包含多个组的正则表达式会返回一个 DataFrame，每个组对应一列。

In [106]: pd.Series(
   .....:     ["a1", "b2", "c3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])(\d)", expand=False)
   .....: 
Out[106]: 
      0     1
0     a     1
1     b     2
2  <NA>  <NA>

不匹配的元素会返回一行填充 NaN 。因此，一个包含混乱字符串的 Series 可以“转换”为一个相同索引的 Series 或 DataFrame，其中包含清理过或更有用的字符串，而不需要使用 get() 来访问元组或 re.match 对象。结果的 dtype 始终是对象，即使没有找到匹配项并且结果仅包含 NaN。

命名组像

In [107]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(
   .....:     r"(?P<letter>[ab])(?P<digit>\d)", expand=False
   .....: )
   .....: 
Out[107]: 
  letter digit
0      a     1
1      b     2
2   <NA>  <NA>

以及可选的组，如

In [108]: pd.Series(
   .....:     ["a1", "b2", "3"],
   .....:     dtype="string",
   .....: ).str.extract(r"([ab])?(\d)", expand=False)
   .....: 
Out[108]: 
      0  1
0     a  1
1     b  2
2  <NA>  3

也可以使用。请注意，正则表达式中的任何捕获组名称将用于列名；否则将使用捕获组编号。

提取一个带有一个组的正则表达式，如果 expand=True，则返回一个带有一列的 DataFrame。

In [109]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=True)
Out[109]: 
      0
0     1
1     2
2  <NA>

如果 expand=False，它返回一个 Series。

In [110]: pd.Series(["a1", "b2", "c3"], dtype="string").str.extract(r"[ab](\d)", expand=False)
Out[110]: 
0       1
1       2
2    <NA>
dtype: string

使用一个带有精确一个捕获组的正则表达式调用 Index 并设置 expand=True 时，返回一个带有一列的 DataFrame。

In [111]: s = pd.Series(["a1", "b2", "c3"], ["A11", "B22", "C33"], dtype="string")

In [112]: s
Out[112]: 
A11    a1
B22    b2
C33    c3
dtype: string

In [113]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=True)
Out[113]: 
  letter
0      A
1      B
2      C

如果 expand=False，它返回一个 Index。

In [114]: s.index.str.extract("(?P<letter>[a-zA-Z])", expand=False)
Out[114]: Index(['A', 'B', 'C'], dtype='object', name='letter')

使用具有多个捕获组的正则表达式调用 Index 并在 expand=True 时返回一个 DataFrame。

In [115]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=True)
Out[115]: 
  letter   1
0      A  11
1      B  22
2      C  33

如果 expand=False，它会引发 ValueError。

In [116]: s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[116], line 1
----> 1 s.index.str.extract("(?P<letter>[a-zA-Z])([0-9]+)", expand=False)

File /home/pandas/pandas/core/strings/accessor.py:137, in forbid_nonstring_types.<locals>._forbid_nonstring_types.<locals>.wrapper(self, *args, **kwargs)
    132     msg = (
    133         f"Cannot use .str.{func_name} with values of "
    134         f"inferred dtype '{self._inferred_dtype}'."
    135     )
    136     raise TypeError(msg)
--> 137 return func(self, *args, **kwargs)

File /home/pandas/pandas/core/strings/accessor.py:2909, in StringMethods.extract(self, pat, flags, expand)
   2906     raise ValueError("pattern contains no capture groups")
   2908 if not expand and regex.groups > 1 and isinstance(self._data, ABCIndex):
-> 2909     raise ValueError("only one regex group is supported with Index")
   2911 obj = self._data
   2912 result_dtype = _result_dtype(obj)

ValueError: only one regex group is supported with Index

下表总结了 extract(expand=False) 的行为（第一列是输入主题，第一行是正则表达式中的组数）

	1 组	>1 组
索引	索引	ValueError
系列	系列	DataFrame

提取每个主题中的所有匹配项（extractall）#

与 extract （它只返回第一个匹配项）不同，

In [117]: s = pd.Series(["a1a2", "b1", "c1"], index=["A", "B", "C"], dtype="string")

In [118]: s
Out[118]: 
A    a1a2
B      b1
C      c1
dtype: string

In [119]: two_groups = "(?P<letter>[a-z])(?P<digit>[0-9])"

In [120]: s.str.extract(two_groups, expand=True)
Out[120]: 
  letter digit
A      a     1
B      b     1
C      c     1

extractall 方法返回每个匹配项。extractall 的结果总是一个带有 MultiIndex 的 DataFrame。MultiIndex 的最后一级名为 match，并指示主题中的顺序。

In [121]: s.str.extractall(two_groups)
Out[121]: 
        letter digit
  match             
A 0          a     1
  1          a     2
B 0          b     1
C 0          c     1

当序列中的每个主题字符串恰好有一个匹配项时，

In [122]: s = pd.Series(["a3", "b3", "c2"], dtype="string")

In [123]: s
Out[123]: 
0    a3
1    b3
2    c2
dtype: string

然后 extractall(pat).xs(0, level='match') 给出了与 extract(pat) 相同的结果。

In [124]: extract_result = s.str.extract(two_groups, expand=True)

In [125]: extract_result
Out[125]: 
  letter digit
0      a     3
1      b     3
2      c     2

In [126]: extractall_result = s.str.extractall(two_groups)

In [127]: extractall_result
Out[127]: 
        letter digit
  match             
0 0          a     3
1 0          b     3
2 0          c     2

In [128]: extractall_result.xs(0, level="match")
Out[128]: 
  letter digit
0      a     3
1      b     3
2      c     2

Index 也支持 .str.extractall 。它返回一个 DataFrame ，该 DataFrame 与具有默认索引（从0开始）的 Series.str.extractall 结果相同。

In [129]: pd.Index(["a1a2", "b1", "c1"]).str.extractall(two_groups)
Out[129]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

In [130]: pd.Series(["a1a2", "b1", "c1"], dtype="string").str.extractall(two_groups)
Out[130]: 
        letter digit
  match             
0 0          a     1
  1          a     2
1 0          b     1
2 0          c     1

测试匹配或包含模式的字符串#

你可以检查元素是否包含某个模式：

In [131]: pattern = r"[0-9][a-z]"

In [132]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.contains(pattern)
   .....: 
Out[132]: 
0    False
1    False
2     True
3     True
4     True
5     True
dtype: boolean

或者元素是否匹配一个模式：

In [133]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.match(pattern)
   .....: 
Out[133]: 
0    False
1    False
2     True
3     True
4    False
5     True
dtype: boolean

In [134]: pd.Series(
   .....:     ["1", "2", "3a", "3b", "03c", "4dx"],
   .....:     dtype="string",
   .....: ).str.fullmatch(pattern)
   .....: 
Out[134]: 
0    False
1    False
2     True
3     True
4    False
5    False
dtype: boolean

备注

match、fullmatch 和 contains 之间的区别在于严格性：fullmatch 测试整个字符串是否与正则表达式匹配；match 测试字符串的第一个字符是否开始与正则表达式匹配；而 contains 测试字符串中的任何位置是否存在与正则表达式的匹配。

这些三种匹配模式在 re 包中对应的函数分别是 re.fullmatch、re.match 和 re.search。

像 match、fullmatch、contains、startswith 和 endswith 这样的方法接受一个额外的 na 参数，因此缺失值可以被视为 True 或 False：

In [135]: s4 = pd.Series(
   .....:     ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype="string"
   .....: )
   .....: 

In [136]: s4.str.contains("A", na=False)
Out[136]: 
0     True
1    False
2    False
3     True
4    False
5    False
6     True
7    False
8    False
dtype: boolean

创建指示变量#

你可以从字符串列中提取虚拟变量。例如，如果它们由 '|' 分隔：

In [137]: s = pd.Series(["a", "a|b", np.nan, "a|c"], dtype="string")

In [138]: s.str.get_dummies(sep="|")
Out[138]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

字符串 Index 也支持 get_dummies ，它返回一个 MultiIndex 。

In [139]: idx = pd.Index(["a", "a|b", np.nan, "a|c"])

In [140]: idx.str.get_dummies(sep="|")
Out[140]: 
MultiIndex([(1, 0, 0),
            (1, 1, 0),
            (0, 0, 0),
            (1, 0, 1)],
           names=['a', 'b', 'c'])

另请参见 get_dummies()。

方法摘要#

方法	描述
`cat()`	连接字符串
`split()`	在分隔符处分割字符串
`rsplit()`	从字符串末尾开始在分隔符处分割字符串
`get()`	索引每个元素（检索第 i 个元素）
`join()`	在每个 Series 元素中使用传递的分隔符连接字符串
`get_dummies()`	在分隔符处分割字符串，返回虚拟变量的DataFrame
`contains()`	如果每个字符串包含模式/正则表达式，则返回布尔数组
`replace()`	将模式/正则表达式/字符串的出现替换为其他字符串或给定出现的可调用返回值
`removeprefix()`	从字符串中移除前缀，即仅在字符串以该前缀开头时移除。
`removesuffix()`	从字符串中移除后缀，即仅在字符串以该后缀结尾时移除。
`repeat()`	重复值 (`s.str.repeat(3)` 等同于 `x * 3`)
`pad()`	在字符串的两侧添加空白
`center()`	等同于 `str.center`
`ljust()`	等同于 `str.ljust`
`rjust()`	等同于 `str.rjust`
`zfill()`	等同于 `str.zfill`
`wrap()`	将长字符串拆分为长度小于给定宽度的行
`slice()`	在 Series 中切片每个字符串
`slice_replace()`	将每个字符串中的切片替换为传递的值
`count()`	计算模式的出现次数
`startswith()`	等同于每个元素的 `str.startswith(pat)`
`endswith()`	等同于每个元素的 `str.endswith(pat)`
`findall()`	计算每个字符串中模式/正则表达式的所有出现次数
`match()`	在每个元素上调用 `re.match` ，返回匹配的组作为列表
`extract()`	对每个元素调用 `re.search` ，返回一个 DataFrame ，每个元素一行，每个正则表达式捕获组一列
`extractall()`	对每个元素调用 `re.findall` ，返回一个 DataFrame ，每行对应一个匹配项，每列对应一个正则表达式捕获组
`len()`	计算字符串长度
`strip()`	等同于 `str.strip`
`rstrip()`	等同于 `str.rstrip`
`lstrip()`	等同于 `str.lstrip`
`partition()`	等同于 `str.partition`
`rpartition()`	等同于 `str.rpartition`
`lower()`	等同于 `str.lower`
`casefold()`	等同于 `str.casefold`
`upper()`	等同于 `str.upper`
`find()`	等同于 `str.find`
`rfind()`	等同于 `str.rfind`
`index()`	等同于 `str.index`
`rindex()`	等同于 `str.rindex`
`capitalize()`	等同于 `str.capitalize`
`swapcase()`	等同于 `str.swapcase`
`normalize()`	返回 Unicode 正规形式。等同于 `unicodedata.normalize`
`translate()`	等同于 `str.translate`
`isalnum()`	等同于 `str.isalnum`
`isalpha()`	等同于 `str.isalpha`
`isdigit()`	等同于 `str.isdigit`
`isspace()`	等同于 `str.isspace`
`islower()`	等同于 `str.islower`
`isupper()`	等同于 `str.isupper`
`istitle()`	等同于 `str.istitle`
`isnumeric()`	等同于 `str.isnumeric`
`isdecimal()`	等同于 `str.isdecimal`