标准化：一个用于标准化二维NumPy数组中列的函数

一个对NumPy数组执行基于列的标准化的函数。

> 从 mlxtend.preprocessing 导入标准化

概述

标准化（或Z-score归一化）的结果是特征将重新缩放，以使其具有标准正态分布的特性，其中

$\mu = 0$ 和 $\sigma = 1$。

其中 $\mu$ 是均值（平均值），$\sigma$ 是偏离均值的标准差；样本的标准分数（也称为z分数）计算为

$$z=\frac{x-\mu}{\sigma}.$$

将特征标准化，使其围绕0中心且标准差为1，不仅在我们比较具有不同单位的测量时很重要，而且对于许多机器学习算法的最佳性能而言，这是一个普遍要求。

一类不依赖于尺度的算法包括基于树的学习算法。我们以通用的CART决策树算法为例。在对信息增益和杂质度量不深入探讨的情况下，我们可以将决策视为“特征x_i >= some_val吗？” 从直观上看，这个特征处于什么尺度（厘米，华氏度，标准化尺度——这真的无关紧要）。

一些特征缩放非常重要的算法示例如下：

k-最近邻（k-nearest neighbors）使用欧几里得距离度量，如果希望所有特征对结果的贡献相同
k-均值（k-means）（见k-最近邻）
逻辑回归（logistic regression），支持向量机（SVMs），感知机（perceptrons），神经网络等，如果你使用基于梯度下降/上升的优化，否则某些权重的更新速度将远快于其他权重
线性判别分析（linear discriminant analysis），主成分分析（principal component analysis），核主成分分析（kernel principal component analysis），因为你希望找到最大化方差的方向（在这些方向/特征向量/主成分是正交的限制下）；你希望将特征置于相同尺度上，因为你会更加重视“较大测量尺度”的变量。

还有许多更多的情况我可能无法一一列举……我始终建议你考虑算法的工作原理，然后通常能清楚地判断出我们是否想对特征进行缩放。

此外，我们还需要考虑是否想对数据进行“标准化”或“归一化”（这里是缩放到[0, 1]范围）。一些算法假设我们的数据以0为中心。例如，如果我们将小型多层感知机的权重初始化为0或围绕零的小随机值，并且我们希望“平等”地更新模型权重。作为经验法则，我会说：当不确定时，就对数据进行标准化，这通常是有益的。

示例 1 - 标准化一个 Pandas 数据框

import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=(range(6)))
s2 = pd.Series([10, 9, 8, 7, 6, 5], index=(range(6)))
df = pd.DataFrame(s1, columns=['s1'])
df['s2'] = s2
df

	s1	s2
0	1	10
1	2	9
2	3	8
3	4	7
4	5	6
5	6	5

from mlxtend.preprocessing import standardize
standardize(df, columns=['s1', 's2'])

	s1	s2
0	-1.46385	1.46385
1	-0.87831	0.87831
2	-0.29277	0.29277
3	0.29277	-0.29277
4	0.87831	-0.87831
5	1.46385	-1.46385

示例 2 - 标准化一个 NumPy 数组

import numpy as np

X = np.array([[1, 10], [2, 9], [3, 8], [4, 7], [5, 6], [6, 5]])
X

array([[ 1, 10],
       [ 2,  9],
       [ 3,  8],
       [ 4,  7],
       [ 5,  6],
       [ 6,  5]])

from mlxtend.preprocessing import standardize
standardize(X, columns=[0, 1])

array([[-1.46385011,  1.46385011],
       [-0.87831007,  0.87831007],
       [-0.29277002,  0.29277002],
       [ 0.29277002, -0.29277002],
       [ 0.87831007, -0.87831007],
       [ 1.46385011, -1.46385011]])

示例 3 - 重新使用参数

在机器学习的上下文中，期望重用从训练集获得的参数，以便对新的未来数据（包括独立测试集）进行缩放。通过设置 return_params=True，standardize 函数返回第二个对象，一个参数字典，其中包含可以通过在函数调用时将其传递给 params 参数来重用的列均值和标准差。

import numpy as np
from mlxtend.preprocessing import standardize

X_train = np.array([[1, 10], [4, 7], [3, 8]])
X_test = np.array([[1, 2], [3, 4], [5, 6]])

X_train_std, params = standardize(X_train, 
                                  columns=[0, 1], 
                                  return_params=True)
X_train_std

array([[-1.33630621,  1.33630621],
       [ 1.06904497, -1.06904497],
       [ 0.26726124, -0.26726124]])

params

{'avgs': array([ 2.66666667,  8.33333333]),
 'stds': array([ 1.24721913,  1.24721913])}

X_test_std = standardize(X_test, 
                         columns=[0, 1], 
                         params=params)
X_test_std

array([[-1.33630621, -5.0779636 ],
       [ 0.26726124, -3.47439614],
       [ 1.87082869, -1.87082869]])

API

standardize(array, columns=None, ddof=0, return_params=False, params=None)

Standardize columns in pandas DataFrames.

Parameters

array : pandas DataFrame or NumPy ndarray, shape = [n_rows, n_columns].
columns : array-like, shape = [n_columns] (default: None)

Array-like with column names, e.g., ['col1', 'col2', ...] or column indices [0, 2, 4, ...] If None, standardizes all columns.
ddof : int (default: 0)

Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
return_params : dict (default: False)

If set to True, a dictionary is returned in addition to the standardized array. The parameter dictionary contains the column means ('avgs') and standard deviations ('stds') of the individual columns.
params : dict (default: None)

A dictionary with column means and standard deviations as returned by the standardize function if return_params was set to True. If a params dictionary is provided, the standardize function will use these instead of computing them from the current array.

Notes

If all values in a given column are the same, these values are all set to 0.0. The standard deviation in the parameters dictionary is consequently set to 1.0 to avoid dividing by zero.

Returns

df_new : pandas DataFrame object.

Copy of the array or DataFrame with standardized columns.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/preprocessing/standardize/