版本 0.13.1 (2014年2月3日)#

这是从 0.13.0 发布的一个小版本，包括少量 API 更改、几个新功能、增强功能和性能改进，以及大量错误修复。我们建议所有用户升级到此版本。

亮点包括：

在 read_csv/to_datetime 中添加了 infer_datetime_format 关键字，以允许对格式统一的日期时间进行加速。
将智能限制日期时间/时间增量格式的显示精度。
增强的 Panel apply() 方法。
在新教程部分中建议的教程。
我们的 pandas 生态系统正在发展，我们现在在新生态系统页面部分中展示相关项目。
在改进文档方面已经进行了大量工作，并且新增了一个贡献部分。
尽管这可能只对开发者感兴趣，但我们 <3 我们的新 CI 状态页面：ScatterCI。

警告

0.13.1 修复了一个由 numpy < 1.8 和在类似字符串的数组上进行链式赋值引起的错误。链式索引可能会产生意外的结果，通常应避免使用。

这之前会导致段错误：

df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})
df["A"].iloc[0] = np.nan

推荐的方法来完成这种类型的任务是：

In [1]: df = pd.DataFrame({"A": np.array(["foo", "bar", "bah", "foo", "bar"])})

In [2]: df.loc[0, "A"] = np.nan

In [3]: df
Out[3]: 
     A
0  NaN
1  bar
2  bah
3  foo
4  bar

输出格式增强#

df.info() 视图现在显示每列的 dtype 信息 (GH 5682)

df.info() 现在支持 max_info_rows 选项，以禁用大型数据框的空值计数 (GH 5974)

In [4]: max_info_rows = pd.get_option("max_info_rows")

In [5]: df = pd.DataFrame(
   ...:     {
   ...:         "A": np.random.randn(10),
   ...:         "B": np.random.randn(10),
   ...:         "C": pd.date_range("20130101", periods=10),
   ...:     }
   ...: )
   ...: 

In [6]: df.iloc[3:6, [0, 2]] = np.nan

# set to not display the null counts
In [7]: pd.set_option("max_info_rows", 0)

In [8]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Dtype         
---  ------  -----         
 0   A       float64       
 1   B       float64       
 2   C       datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

# this is the default (same as in 0.13.0)
In [9]: pd.set_option("max_info_rows", max_info_rows)

In [10]: df.info()
<class 'pandas.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   A       7 non-null      float64       
 1   B       10 non-null     float64       
 2   C       7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), float64(2)
memory usage: 368.0 bytes

为新的 DataFrame repr 添加 show_dimensions 显示选项，以控制是否打印尺寸。

In [11]: df = pd.DataFrame([[1, 2], [3, 4]])

In [12]: pd.set_option("show_dimensions", False)

In [13]: df
Out[13]: 
   0  1
0  1  2
1  3  4

In [14]: pd.set_option("show_dimensions", True)

In [15]: df
Out[15]: 
   0  1
0  1  2
1  3  4

[2 rows x 2 columns]

ArrayFormatter 用于 datetime 和 timedelta64 现在根据数组中的值智能地限制精度 (GH 3401)

之前输出的可能看起来像：

  age                 today               diff
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00

现在输出看起来像：

In [16]: df = pd.DataFrame(
   ....:     [pd.Timestamp("20010101"), pd.Timestamp("20040601")], columns=["age"]
   ....: )
   ....: 

In [17]: df["today"] = pd.Timestamp("20130419")

In [18]: df["diff"] = df["today"] - df["age"]

In [19]: df
Out[19]: 
         age      today      diff
0 2001-01-01 2013-04-19 4491 days
1 2004-06-01 2013-04-19 3244 days

[2 rows x 3 columns]

API 变化#

将 -NaN 和 -nan 添加到默认的 NA 值集合中 (GH 5952)。请参见 NA 值。

添加了 Series.str.get_dummies 矢量化字符串方法 (GH 6021)，用于提取分离字符串列的虚拟/指示变量：

In [20]: s = pd.Series(["a", "a|b", np.nan, "a|c"])

In [21]: s.str.get_dummies(sep="|")
Out[21]: 
   a  b  c
0  1  0  0
1  1  1  0
2  0  0  0
3  1  0  1

[4 rows x 3 columns]

添加了 NDFrame.equals() 方法来比较两个 NDFrame 是否相等，包括相同的轴、数据类型和值。添加了 array_equivalent 函数来比较两个 ndarrays 是否相等。在相同位置的 NaN 被视为相等。(GH 5283) 另请参阅文档以获取动机示例。
```
df = pd.DataFrame({"col": ["foo", 0, np.nan]})
df2 = pd.DataFrame({"col": [np.nan, 0, "foo"]}, index=[2, 1, 0])
df.equals(df2)
df.equals(df2.sort_index())
```

DataFrame.apply 将使用 reduce 参数来确定当 DataFrame 为空时，应该返回 Series 还是 DataFrame (GH 6007)。

之前，对一个空的 DataFrame 调用 DataFrame.apply 会返回一个 DataFrame 如果没有列，或者被应用的函数会用一个空的 Series 来猜测是否应该返回一个 Series 或 DataFrame：

In [32]: def applied_func(col):
  ....:    print("Apply function being called with: ", col)
  ....:    return col.sum()
  ....:

In [33]: empty = DataFrame(columns=['a', 'b'])

In [34]: empty.apply(applied_func)
Apply function being called with:  Series([], Length: 0, dtype: float64)
Out[34]:
a   NaN
b   NaN
Length: 2, dtype: float64

现在，当在空的 DataFrame 上调用 apply 时：如果 reduce 参数是 True ，将返回一个 Series ；如果是 False ，将返回一个 DataFrame ；如果是 None （默认值），将使用空的 series 调用被应用的函数以尝试猜测返回类型。

In [35]: empty.apply(applied_func, reduce=True)
Out[35]:
a   NaN
b   NaN
Length: 2, dtype: float64

In [36]: empty.apply(applied_func, reduce=False)
Out[36]:
Empty DataFrame
Columns: [a, b]
Index: []

[0 rows x 2 columns]

先前版本的弃用/更改#

在0.13或之前的版本中没有宣布的更改会在0.13.1版本中生效。

弃用#

在 0.13.1 中没有对之前行为的弃用。

增强功能#

pd.read_csv 和 pd.to_datetime 学习了一个新的 infer_datetime_format 关键字，这在许多情况下大大提高了解析性能。感谢 @lexual 的建议和 @danbirken 的迅速实现。(GH 5490, GH 6021)

如果 parse_dates 已启用并且设置了此标志，pandas 将尝试推断列中日期时间字符串的格式，如果可以推断，则切换到更快的解析方法。在某些情况下，这可以将解析速度提高 ~5-10 倍。
```
# Try to infer the format for the index column
df = pd.read_csv(
    "foo.csv", index_col=0, parse_dates=True, infer_datetime_format=True
)
```
date_format 和 datetime_format 关键字现在可以在写入 excel 文件时指定 (GH 4133)

MultiIndex.from_product 用于从一组可迭代对象的笛卡尔积创建 MultiIndex 的便捷函数 (GH 6055):

In [22]: shades = ["light", "dark"]

In [23]: colors = ["red", "green", "blue"]

In [24]: pd.MultiIndex.from_product([shades, colors], names=["shade", "color"])
Out[24]: 
MultiIndex([('light',   'red'),
            ('light', 'green'),
            ('light',  'blue'),
            ( 'dark',   'red'),
            ( 'dark', 'green'),
            ( 'dark',  'blue')],
           names=['shade', 'color'])

Panel apply() 将在非ufuncs上工作。请参见文档。

In [28]: import pandas._testing as tm

In [29]: panel = tm.makePanel(5)

In [30]: panel
Out[30]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [31]: panel['ItemA']
Out[31]:
                   A         B         C         D
2000-01-03 -0.673690  0.577046 -1.344312 -1.469388
2000-01-04  0.113648 -1.715002  0.844885  0.357021
2000-01-05 -1.478427 -1.039268  1.075770 -0.674600
2000-01-06  0.524988 -0.370647 -0.109050 -1.776904
2000-01-07  0.404705 -1.157892  1.643563 -0.968914

[5 rows x 4 columns]

指定一个对 Series 进行操作的 apply （返回单个元素）

In [32]: panel.apply(lambda x: x.dtype, axis='items')
Out[32]:
                  A        B        C        D
2000-01-03  float64  float64  float64  float64
2000-01-04  float64  float64  float64  float64
2000-01-05  float64  float64  float64  float64
2000-01-06  float64  float64  float64  float64
2000-01-07  float64  float64  float64  float64

[5 rows x 4 columns]

类似的减少类型操作

In [33]: panel.apply(lambda x: x.sum(), axis='major_axis')
Out[33]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

这相当于

In [34]: panel.sum('major_axis')
Out[34]:
      ItemA     ItemB     ItemC
A -1.108775 -1.090118 -2.984435
B -3.705764  0.409204  1.866240
C  2.110856  2.960500 -0.974967
D -4.532785  0.303202 -3.685193

[4 rows x 3 columns]

一个返回 Panel 的转换操作，但在 major_axis 上计算 z-score

In [35]: result = panel.apply(lambda x: (x - x.mean()) / x.std(),
  ....:                      axis='major_axis')
  ....:

In [36]: result
Out[36]:
<class 'pandas.core.panel.Panel'>
Dimensions: 3 (items) x 5 (major_axis) x 4 (minor_axis)
Items axis: ItemA to ItemC
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: A to D

In [37]: result['ItemA']                           # noqa E999
Out[37]:
                  A         B         C         D
2000-01-03 -0.535778  1.500802 -1.506416 -0.681456
2000-01-04  0.397628 -1.108752  0.360481  1.529895
2000-01-05 -1.489811 -0.339412  0.557374  0.280845
2000-01-06  0.885279  0.421830 -0.453013 -1.053785
2000-01-07  0.742682 -0.474468  1.041575 -0.075499

[5 rows x 4 columns]

Panel apply() 在横截面板块上操作。(GH 1148)

In [38]: def f(x):
   ....:     return ((x.T - x.mean(1)) / x.std(1)).T
   ....:

In [39]: result = panel.apply(f, axis=['items', 'major_axis'])

In [40]: result
Out[40]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [41]: result.loc[:, :, 'ItemA']
Out[41]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

这等同于以下内容

In [42]: result = pd.Panel({ax: f(panel.loc[:, :, ax]) for ax in panel.minor_axis})

In [43]: result
Out[43]:
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: A to D
Major_axis axis: 2000-01-03 00:00:00 to 2000-01-07 00:00:00
Minor_axis axis: ItemA to ItemC

In [44]: result.loc[:, :, 'ItemA']
Out[44]:
                   A         B         C         D
2000-01-03  0.012922 -0.030874 -0.629546 -0.757034
2000-01-04  0.392053 -1.071665  0.163228  0.548188
2000-01-05 -1.093650 -0.640898  0.385734 -1.154310
2000-01-06  1.005446 -1.154593 -0.595615 -0.809185
2000-01-07  0.783051 -0.198053  0.919339 -1.052721

[5 rows x 4 columns]

性能#

0.13.1 的性能改进

系列日期时间/时间增量二进制操作 (GH 5801)
DataFrame count/dropna 用于 axis=1
Series.str.contains 现在有一个 regex=False 关键字，对于纯（非正则表达式）字符串模式，这可以更快。 (GH 5879)
Series.str.extract (GH 5944)
dtypes/ftypes 方法 (GH 5968)
使用对象dtypes进行索引 (GH 5968)
DataFrame.apply (GH 6013)
JSON IO 中的回归 (GH 5765)
从系列构建索引 (GH 6150)

实验性#

在 0.13.1 中没有实验性更改

错误修复#

io.wb.get_countries 中的错误未包含所有国家 (GH 6008)
在系列中用时间戳字典替换错误 (GH 5797)
read_csv/read_table 现在尊重 prefix kwarg (GH 5732)。
通过 .ix 从重复索引的 DataFrame 中选择缺失值时出现的错误 (GH 5835)
修复空DataFrame上的布尔比较问题 (GH 5808)
在对象数组中处理 NaT 的 isnull 错误 (GH 5443)
当传递 np.nan 或整数日期类和格式字符串时 to_datetime 中的错误 (GH 5863)
在具有datetimelike的groupby dtype转换中的错误 (GH 5869)
处理作为索引器的空 Series 的回归 (GH 5877)
内部缓存中的错误，相关于 (GH 5727)
在py3下，在Windows上从非文件路径读取JSON/msgpack时测试错误 (GH 5874)
分配给 .ix[tuple(…)] 时的错误 (GH 5896)
在完全重新索引一个 Panel 中的 Bug (GH 5905)
对象数据类型的 idxmin/max 中的错误 (GH 5914)
当 n>5 且 n%5==0 时，在 BusinessDay 中添加 n 天到一个不在偏移量上的日期时出现的错误 (GH 5890)
通过 ix 分配给链式序列时出现的错误 (GH 5928)
在创建一个空的 DataFrame，复制，然后赋值时出现的错误 (GH 5932)
空帧中 DataFrame.tail 的错误 (GH 5846)
在 resample 上传播元数据的错误 (GH 5862)
修复了 NaT 的字符串表示为 “NaT” (GH 5708)
固定 Timestamp 的字符串表示以显示纳秒（如果存在）(GH 5912)
pd.match 没有返回传递的哨兵
当 major_axis 是 MultiIndex 时，Panel.to_frame() 不再失败 (GH 5402)。
pd.read_msgpack 中推断 DateTimeIndex 频率错误的问题 (GH 5947)
修复了包含 Tz-aware 日期时间和 NaT 的数组的 to_datetime 问题 (GH 5961)
当传递包含错误数据的Series时，滚动偏度/峰度的错误 (GH 5749)
scipy interpolate 方法中使用 datetime 索引的错误 (GH 5975)
如果传递了混合的 datetime/np.datetime64 与 NaT 的比较错误 (GH 5968)
修复了 pd.concat 在所有输入均为空时丢失 dtype 信息的问题 (GH 5742)
最近IPython的变化在使用旧版本的pandas时会在QTConsole中发出警告，现在已修复。如果你使用的是旧版本并且需要抑制这些警告，请参见 (GH 5922)。
合并 timedelta 数据类型中的错误 (GH 5695)
绘图.scatter_matrix 函数中的错误。对角线和非对角线图之间的对齐错误，见 (GH 5497)。
通过 ix 在 MultiIndex 系列中进行回归 (GH 6018)
带有MultiIndex的Series.xs中的Bug (GH 6018)
在混合类型系列构造中存在一个错误，该错误涉及日期类型和一个整数（这应该导致对象类型而不是自动转换）(GH 6028)
在NumPy 1.7.1下，当使用对象数组进行链式索引时可能会发生段错误 (GH 6026, GH 6056)
在使用非标量（例如列表）的单个元素进行花式索引时设置中的错误，(GH 6043)
to_sql 没有尊重 if_exists (GH 4110 GH 4304)
从0.12版本开始，.get(None) 索引的回归 (GH 5652)
微妙的 iloc 索引错误，出现在 (GH 6059)
将字符串插入 DatetimeIndex 的错误 (GH 5818)
修复了 to_html/HTML repr 中的固定 unicode 错误 (GH 6098)
修复了 get_options_data 中缺少参数验证的问题 (GH 6105)
在具有重复列的框架中分配时出现的错误，其中位置是切片（例如，相邻）(GH 6120)
在构建具有重复索引/列的 DataFrame 时传播 _ref_locs 的错误 (GH 6121)
在使用混合日期时间减少时 DataFrame.apply 中的错误 (GH 6125)
当附加具有不同列的行时 DataFrame.append 中的错误 (GH 6129)
使用 recarray 和非 ns 的 datetime dtype 构造 DataFrame 时的错误 (GH 6140)
在 .loc setitem 索引中存在一个错误，当右侧是一个数据框、设置多个项目以及时间类型时 (GH 6152)
在字典序字符串比较期间修复了 query/eval 中的一个错误 (GH 6155)。
修复了 query 中的一个错误，其中单元素 Series 的索引被丢弃了 (GH 6148)。
在向现有表追加具有多索引列的数据帧时 HDFStore 中的错误 (GH 6167)
在设置空DataFrame时与dtypes保持一致 (GH 6171)
在多索引 HDFStore 中选择时即使存在未指定的列规范的错误 (GH 6169)
在 nanops.var 中使用 ddof=1 和一个元素时，在某些平台上有时会返回 inf 而不是 nan (GH 6136)
在 Series 和 DataFrame 条形图中忽略 use_index 关键字的问题 (GH 6209)
在python3下使用混合的str/int进行groupby时的错误已修复；argsort 失败 (GH 6212)

贡献者#

共有 52 人为此版本贡献了补丁。名字后面带有“+”的人首次贡献了补丁。

Alex Rothberg
Alok Singhal +
Andrew Burrows +
Andy Hayden
Bjorn Arneson +
Brad Buran
Caleb Epstein
Chapman Siu
Chase Albert +
Clark Fitzgerald +
DSM
Dan Birken
Daniel Waeber +
David Wolever +
Doran Deluz +
Douglas McNeil +
Douglas Rudd +
Dražen Lučanin
Elliot S +
Felix Lawrence +
George Kuan +
Guillaume Gay +
Jacob Schaer
Jan Wagner +
Jeff Tratner
John McNamara
Joris Van den Bossche
Julia Evans +
Kieran O’Mahony
Michael Schatzow +
Naveen Michaud-Agrawal +
Patrick O’Keeffe +
Phillip Cloud
Roman Pekar
Skipper Seabold
Spencer Lyon
Tom Augspurger +
TomAugspurger
acorbe +
akittredge +
bmu +
bwignall +
chapman siu
danielballan
david +
davidshinn
immerrr +
jreback
lexual
mwaskom +
unutbu
y-p