版本 0.10.1 (2013年1月22日)#

这是从 0.10.0 发布的一个小版本，包括新功能、增强功能和错误修复。特别是，Jeff Reback 贡献了大量新的 HDFStore 功能。

对于带有 inplace 选项的函数的意外 API 中断已被恢复，并添加了弃用警告。

API 变化#

接受 inplace 选项的函数返回调用对象之前的状态。已添加弃用消息。
Groupby 聚合最大/最小值不再排除非数字数据 (GH 2700)
重采样一个空的 DataFrame 现在返回一个空的 DataFrame 而不是引发异常 (GH 2640)
当在显式指定的整数列中发现NA值时，文件读取器现在将引发异常，而不是将该列转换为浮点数 (GH 2631)
DatetimeIndex.unique 现在返回一个具有相同名称的 DatetimeIndex
时区而不是数组 (GH 2563)

新功能#

MySQL 对数据库的支持（来自 Dan Allan 的贡献）

HDFStore#

您可能需要升级现有的数据文件。请访问主文档中的 兼容性 部分。

你可以通过传递一个列表来指定（并索引）你希望能够在表上执行查询的某些列，使用 data_columns

In [1]: store = pd.HDFStore("store.h5")

In [2]: df = pd.DataFrame(
   ...:     np.random.randn(8, 3),
   ...:     index=pd.date_range("1/1/2000", periods=8),
   ...:     columns=["A", "B", "C"],
   ...: )
   ...: 

In [3]: df["string"] = "foo"

In [4]: df.loc[df.index[4:6], "string"] = np.nan

In [5]: df.loc[df.index[7:9], "string"] = "bar"

In [6]: df["string2"] = "cool"

In [7]: df
Out[7]: 
                   A         B         C string string2
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool
2000-01-04 -2.104569 -0.494929  1.071804    foo    cool
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool
2000-01-08  0.113648 -1.478427  0.524988    bar    cool

# on-disk operations
In [8]: store.append("df", df, data_columns=["B", "C", "string", "string2"])

In [9]: store.select("df", "B>0 and string=='foo'")
Out[9]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool

# this is in-memory version of this type of selection
In [10]: df[(df.B > 0) & (df.string == "foo")]
Out[10]: 
                   A         B         C string string2
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool

在可索引或数据列中检索唯一值。

# note that this is deprecated as of 0.14.0
# can be replicated by: store.select_column('df','index').unique()
store.unique("df", "index")
store.unique("df", "string")

你现在可以在数据列中存储 datetime64

In [11]: df_mixed = df.copy()

In [12]: df_mixed["datetime64"] = pd.Timestamp("20010102")

In [13]: df_mixed.loc[df_mixed.index[3:4], ["A", "B"]] = np.nan

In [14]: store.append("df_mixed", df_mixed)

In [15]: df_mixed1 = store.select("df_mixed")

In [16]: df_mixed1
Out[16]: 
                   A         B         C string string2 datetime64
2000-01-01  0.469112 -0.282863 -1.509059    foo    cool 2001-01-02
2000-01-02 -1.135632  1.212112 -0.173215    foo    cool 2001-01-02
2000-01-03  0.119209 -1.044236 -0.861849    foo    cool 2001-01-02
2000-01-04       NaN       NaN  1.071804    foo    cool 2001-01-02
2000-01-05  0.721555 -0.706771 -1.039575    NaN    cool 2001-01-02
2000-01-06  0.271860 -0.424972  0.567020    NaN    cool 2001-01-02
2000-01-07  0.276232 -1.087401 -0.673690    foo    cool 2001-01-02
2000-01-08  0.113648 -1.478427  0.524988    bar    cool 2001-01-02

In [17]: df_mixed1.dtypes.value_counts()
Out[17]: 
float64          3
object           2
datetime64[s]    1
Name: count, dtype: int64

你可以传递 columns 关键字给 select 来过滤返回列的列表，这等同于传递一个 Term('columns',list_of_columns_to_filter)

In [18]: store.select("df", columns=["A", "B"])
Out[18]: 
                   A         B
2000-01-01  0.469112 -0.282863
2000-01-02 -1.135632  1.212112
2000-01-03  0.119209 -1.044236
2000-01-04 -2.104569 -0.494929
2000-01-05  0.721555 -0.706771
2000-01-06  0.271860 -0.424972
2000-01-07  0.276232 -1.087401
2000-01-08  0.113648 -1.478427

HDFStore 现在在追加表时序列化 MultiIndex 数据帧。

In [19]: index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
   ....:                               ['one', 'two', 'three']],
   ....:                       labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
   ....:                               [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
   ....:                       names=['foo', 'bar'])
   ....:

In [20]: df = pd.DataFrame(np.random.randn(10, 3), index=index,
   ....:                   columns=['A', 'B', 'C'])
   ....:

In [21]: df
Out[21]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

In [22]: store.append('mi', df)

In [23]: store.select('mi')
Out[23]:
                  A         B         C
foo bar
foo one   -0.116619  0.295575 -1.047704
    two    1.640556  1.905836  2.772115
    three  0.088787 -1.144197 -0.633372
bar one    0.925372 -0.006438 -0.820408
    two   -0.600874 -1.039266  0.824758
baz two   -0.824095 -0.337730 -0.927764
    three -0.840123  0.248505 -0.109250
qux one    0.431977 -0.460710  0.336505
    two   -3.207595 -1.535854  0.409769
    three -0.673145 -0.741113 -0.110891

# the levels are automatically included as data columns
In [24]: store.select('mi', "foo='bar'")
Out[24]:
                A         B         C
foo bar
bar one  0.925372 -0.006438 -0.820408
    two -0.600874 -1.039266  0.824758

通过 append_to_multiple 和 select_as_multiple 进行多表创建和选择可以创建/选择多个表并通过在选择器表上使用 where 返回组合结果。

In [19]: df_mt = pd.DataFrame(
   ....:     np.random.randn(8, 6),
   ....:     index=pd.date_range("1/1/2000", periods=8),
   ....:     columns=["A", "B", "C", "D", "E", "F"],
   ....: )
   ....: 

In [20]: df_mt["foo"] = "bar"

# you can also create the tables individually
In [21]: store.append_to_multiple(
   ....:     {"df1_mt": ["A", "B"], "df2_mt": None}, df_mt, selector="df1_mt"
   ....: )
   ....: 

In [22]: store
Out[22]: 
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

# individual tables were created
In [23]: store.select("df1_mt")
Out[23]: 
                   A         B
2000-01-01  0.404705  0.577046
2000-01-02 -1.344312  0.844885
2000-01-03  0.357021 -0.674600
2000-01-04  0.276662 -0.472035
2000-01-05  0.895717  0.805244
2000-01-06 -1.170299 -0.226169
2000-01-07 -0.076467 -1.187678
2000-01-08  1.024180  0.569605

In [24]: store.select("df2_mt")
Out[24]: 
                   C         D         E         F  foo
2000-01-01 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-02  1.075770 -0.109050  1.643563 -1.469388  bar
2000-01-03 -1.776904 -0.968914 -1.294524  0.413738  bar
2000-01-04 -0.013960 -0.362543 -0.006154 -0.923061  bar
2000-01-05 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-06  0.410835  0.813850  0.132003 -0.827317  bar
2000-01-07  1.130127 -1.436737 -1.413681  1.607920  bar
2000-01-08  0.875906 -2.211372  0.974466 -2.006747  bar

# as a multiple
In [25]: store.select_as_multiple(
   ....:     ["df1_mt", "df2_mt"], where=["A>0", "B>0"], selector="df1_mt"
   ....: )
   ....: 
Out[25]: 
                   A         B         C         D         E         F  foo
2000-01-01  0.404705  0.577046 -1.715002 -1.039268 -0.370647 -1.157892  bar
2000-01-05  0.895717  0.805244 -1.206412  2.565646  1.431256  1.340309  bar
2000-01-08  1.024180  0.569605  0.875906 -2.211372  0.974466 -2.006747  bar

增强功能

HDFStore 现在可以读取原生的 PyTables 表格式表
你可以传递 nan_rep = 'my_nan_rep' 到 append，以更改磁盘上默认的 nan 表示（这会转换为/从 np.nan），这默认为 nan。
你可以传递 index 给 append。这默认为 True。这将自动在表的 indexables 和 data columns 上创建索引。
你可以传递 chunksize=一个整数 给 append，以改变写入的块大小（默认是50000）。这将显著降低你在写入时的内存使用。
你可以传递 expectedrows=一个整数 给第一个 append，以设置 PyTables 将预期的总行数。这将优化读/写性能。
Select 现在支持传递 start 和 stop 以在选择中提供选择空间限制。
显著改进了文件解析器对 ISO8601 日期（例如，yyyy-mm-dd）的解析 (GH 2698)
允许 DataFrame.merge 处理对于64位整数来说太大的组合大小 (GH 2690)
Series 现在有了一元否定 (-series) 和反转 (~series) 运算符 (GH 2686)
DataFrame.plot 现在包含一个 logx 参数，用于将 x 轴更改为对数刻度 (GH 2327)
序列算术运算符现在可以处理常量和 ndarray 输入 (GH 2574)
ExcelFile 现在接受一个 kind 参数来指定文件类型 (GH 2613)
Series.str 方法的更快实现 (GH 2602)

错误修复

HDFStore 表现在可以正确存储 float32 类型（但不能与 float64 混合）
在指定请求段时修复了 Google Analytics 前缀 (GH 2713)。
重置 Google Analytics 令牌存储的函数，以便用户可以从客户端密钥设置不当中恢复 (GH 2687)。
修复了在传递 MultiIndex 时导致段错误的固定 groupby 错误 (GH 2706)
修复了将带有 datetime64 值的 Series 传递给 to_datetime 时导致错误输出值的错误 (GH 2699)
修复了当模式不是有效正则表达式时 pattern in HDFStore 表达式中的错误 (GH 2694)
修复了在聚合布尔数据时出现的性能问题 (GH 2692)
当给定一个布尔掩码键和一个新值的序列时，Series __setitem__ 现在会将传入的值与原始序列对齐 (GH 2686)
修复了在对具有大量组合值的 MultiIndex 级别进行计数排序时导致的 Fixed MemoryError (GH 2684)
修复了当索引是带有固定偏移时区的 DatetimeIndex 时导致绘图失败的错误 (GH 2683)
当偏移量超过5个工作日且起始日期在周末时，修正了工作日减法逻辑 (GH 2680)
当文件的列数多于数据时，固定C文件解析器的行为 (GH 2668)
修复了文件读取器错误，该错误在存在隐式列和指定的 usecols 值时导致列与数据不对齐
具有数值或日期时间索引的 DataFrame 现在在绘图之前进行排序 (GH 2609)
修复了当传递列、索引但记录为空时的 DataFrame.from_records 错误 (GH 2633)
修复了当 dtype 为 datetime64 时 Series 操作的几个错误 (GH 2689, GH 2629, GH 2626)

请参阅完整发布说明或在 GitHub 上的问题跟踪器以获取完整列表。

贡献者#

总共有17人为此版本贡献了补丁。名字后面带有“+”的人首次贡献了补丁。

Andy Hayden +
Anton I. Sipos +
Chang She
Christopher Whelan
Damien Garaud +
Dan Allan +
Dieter Vandenbussche
Garrett Drapala +
Jay Parlar +
Thouis (Ray) Jones +
Vincent Arel-Bundock +
Wes McKinney
elpres
herrfz +
jreback
svaksha +
y-p