Softmax回归：逻辑回归的多分类版本

一个用于多类分类任务的逻辑回归类。

# 使用Softmax回归进行分类

在本节中，我们将介绍如何使用Softmax回归模型进行多类分类问题的解决。首先，我们需要导入`SoftmaxRegression`类。

概述

Softmax 回归（同义词：多项式逻辑回归、最大熵分类器，或简称为多类逻辑回归）是逻辑回归的一种推广，我们可以将其用于多类分类（在假设各类互斥的情况下）。相比之下，我们在二分类任务中使用（标准的）逻辑回归模型。

以下是逻辑回归模型的示意图，更多详情请参见LogisticRegression手册。

在Softmax回归（SMR）中，我们用所谓的softmax函数 $\phi_{softmax}(\cdot)$ 替代了 sigmoid 逻辑函数。

$P(y=j \mid z^{(i)}) = \phi_{softmax}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}},$

我们将净输入 z 定义为

$z = w_1x_1 + ... + w_mx_m + b= \sum_{l=1}^{m} w_l x_l + b= \mathbf{w}^T\mathbf{x} + b.$

（w 是权重向量，$\mathbf{x}$ 是一个训练样本的特征向量，$b$ 是偏置单元。）
现在，这个 softmax 函数计算给定权重和净输入 $z^{(i)}$ 时，这个训练样本 $\mathbf{x}^{(i)}$ 属于类别 $j$ 的概率。因此，我们计算每个类标签 $j = 1, \ldots, k$ 的概率 $p(y = j \mid \mathbf{x^{(i)}; w}_j)$。请注意分母中的归一化项，使得这些类别概率的总和为一。

为了说明softmax的概念，让我们通过一个具体的例子来演示。假设我们有一个训练集，由来自3个不同类别（0、1和2）的4个样本组成。

$x_0 \rightarrow \text{类别 }0$
$x_1 \rightarrow \text{类别 }1$
$x_2 \rightarrow \text{类别 }2$
$x_3 \rightarrow \text{类别 }2$

import numpy as np

y = np.array([0, 1, 2, 2])

首先，我们想将类标签编码成一个我们更容易处理的格式；我们应用独热编码：

y_enc = (np.arange(np.max(y) + 1) == y[:, None]).astype(float)

print('one-hot encoding:\n', y_enc)

one-hot encoding:
 [[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 0.  0.  1.]]

属于类别0的样本（第一行）在第一个单元格中为1，属于类别2的样本在其行的第二个单元格中为1，以此类推。

接下来，让我们定义我们4个训练样本的特征矩阵。在这里，我们假设我们的数据集包含2个特征；因此，我们创建一个4x2维的样本和特征矩阵。类似地，我们创建一个2x3维的权重矩阵（每个特征一行，每个类别一列）。

X = np.array([[0.1, 0.5],
              [1.1, 2.3],
              [-1.1, -2.3],
              [-1.5, -2.5]])

W = np.array([[0.1, 0.2, 0.3],
              [0.1, 0.2, 0.3]])

bias = np.array([0.01, 0.1, 0.1])

print('Inputs X:\n', X)
print('\nWeights W:\n', W)
print('\nbias:\n', bias)

Inputs X:
 [[ 0.1  0.5]
 [ 1.1  2.3]
 [-1.1 -2.3]
 [-1.5 -2.5]]

Weights W:
 [[ 0.1  0.2  0.3]
 [ 0.1  0.2  0.3]]

bias:
 [ 0.01  0.1   0.1 ]

要计算净输入，我们将4x2的特征矩阵 X 与2x3（特征数 x 类别数）权重矩阵 W 相乘，这将产生一个4x3的输出矩阵（样本数 x 类别数），然后我们再加上偏置单位：

$\mathbf{Z} = \mathbf{X}\mathbf{W} + \mathbf{b}.$

X = np.array([[0.1, 0.5],
              [1.1, 2.3],
              [-1.1, -2.3],
              [-1.5, -2.5]])

W = np.array([[0.1, 0.2, 0.3],
              [0.1, 0.2, 0.3]])

bias = np.array([0.01, 0.1, 0.1])

print('Inputs X:\n', X)
print('\nWeights W:\n', W)
print('\nbias:\n', bias)

Inputs X:
 [[ 0.1  0.5]
 [ 1.1  2.3]
 [-1.1 -2.3]
 [-1.5 -2.5]]

Weights W:
 [[ 0.1  0.2  0.3]
 [ 0.1  0.2  0.3]]

bias:
 [ 0.01  0.1   0.1 ]

def net_input(X, W, b):
    return (X.dot(W) + b)

net_in = net_input(X, W, bias)
print('net input:\n', net_in)

net input:
 [[ 0.07  0.22  0.28]
 [ 0.35  0.78  1.12]
 [-0.33 -0.58 -0.92]
 [-0.39 -0.7  -1.1 ]]

现在是计算我们之前讨论的softmax激活的时候了：

$P(y=j \mid z^{(i)}) = \phi_{softmax}(z^{(i)}) = \frac{e^{z^{(i)}}}{\sum_{j=0}^{k} e^{z_{k}^{(i)}}}.$

def softmax(z):
    return (np.exp(z.T) / np.sum(np.exp(z), axis=1)).T

smax = softmax(net_in)
print('softmax:\n', smax)

softmax:
 [[ 0.29450637  0.34216758  0.36332605]
 [ 0.21290077  0.32728332  0.45981591]
 [ 0.42860913  0.33380113  0.23758974]
 [ 0.44941979  0.32962558  0.22095463]]

正如我们所看到的，每个样本（行）的值现在很好地加起来为1。例如，我们可以说第一个样本
[ 0.29450637 0.34216758 0.36332605] 有29.45%的概率属于类别0。

现在，为了将这些概率重新转换为类别标签，我们可以简单地取每一行的argmax索引位置：

[[ 0.29450637 0.34216758 0.36332605] -> 2
[ 0.21290077 0.32728332 0.45981591] -> 2
[ 0.42860913 0.33380113 0.23758974] -> 0
[ 0.44941979 0.32962558 0.22095463]] -> 0

def to_classlabel(z):
    return z.argmax(axis=1)

print('predicted class labels: ', to_classlabel(smax))

predicted class labels:  [2 2 0 0]

如我们所见，我们的预测非常错误，因为正确的类别标签是 [0, 1, 2, 2]。现在，为了训练我们的逻辑模型（例如，通过优化算法如梯度下降），我们需要定义一个我们想要最小化的代价函数 $J(\cdot)$：

$J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum_{i=1}^{n} H(T_i, O_i)，$

这是我们 $n$ 个训练样本的所有交叉熵的平均值。交叉熵函数定义为：

$H(T_i, O_i) = -\sum_m T_i \cdot log(O_i)$ 。

这里的 $T$ 代表“目标”（即，真实类别标签），而 $O$ 代表输出——通过 softmax 计算的概率；不是预测的类别标签。

def cross_entropy(output, y_target):
    return - np.sum(np.log(output) * (y_target), axis=1)

xent = cross_entropy(smax, y_enc)
print('Cross Entropy:', xent)

Cross Entropy: [ 1.22245465  1.11692907  1.43720989  1.50979788]

def cost(output, y_target):
    return np.mean(cross_entropy(output, y_target))

J_cost = cost(smax, y_enc)
print('Cost: ', J_cost)

Cost:  1.32159787159

为了通过梯度下降学习我们的softmax模型——确定权重系数——我们需要计算导数

$\nabla \mathbf{w}_j \, J(\mathbf{W}; \mathbf{b})$ 。

我不想在这里详细讲解繁琐的细节，但这个成本导数实际上很简单：

$$\nabla \mathbf{w}j \, J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum^{n}{i=0} \big[\mathbf{x}^{(i)}\ \big(O_i - T_i \big) \big]$$

然后我们可以使用成本导数以学习率 $\eta$ 沿着成本梯度的反方向更新权重：

$\mathbf{w}_j := \mathbf{w}_j - \eta \nabla \mathbf{w}_j \, J(\mathbf{W}; \mathbf{b})$

对于每个类 $j \in {0, 1, ..., k}$

（注意 $\mathbf{w}_j$ 是类 $y=j$ 的权重向量），并且我们更新偏置单元

$$\mathbf{b}j := \mathbf{b}_j - \eta \bigg[ \frac{1}{n} \sum^{n}{i=0} \big(O_i - T_i \big) \bigg].$$

作为减少复杂性的惩罚，一种通过添加额外偏差来降低模型方差和减少过拟合程度的方法，我们可以进一步添加一个正则化项，例如带有正则化参数 $\lambda$ 的 L2 项：

L2: $\frac{\lambda}{2} ||\mathbf{w}||_{2}^{2}$,

其中

$$||\mathbf{w}||{2}^{2} = \sum^{m}{l=0} \sum^{k}{j=0} w{i, j}$$

因此我们的代价函数变为

$J(\mathbf{W}; \mathbf{b}) = \frac{1}{n} \sum_{i=1}^{n} H(T_i, O_i) + \frac{\lambda}{2} ||\mathbf{w}||_{2}^{2}$

我们将“正则化”权重更新定义为

$\mathbf{w}_j := \mathbf{w}_j - \eta \big[\nabla \mathbf{w}_j \, J(\mathbf{W}) + \lambda \mathbf{w}_j \big].$

（请注意，我们不对偏差项进行正则化。）

示例 1 - 梯度下降

from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
from mlxtend.classifier import SoftmaxRegression
import matplotlib.pyplot as plt

# 加载数据

X, y = iris_data()
X = X[:, [0, 3]] # 萼片长度和花瓣宽度

# 标准化
X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

lr = SoftmaxRegression(eta=0.01, 
                       epochs=500, 
                       minibatches=1, 
                       random_seed=1,
                       print_progress=3)
lr.fit(X, y)

plot_decision_regions(X, y, clf=lr)
plt.title('Softmax Regression - Gradient Descent')
plt.show()

plt.plot(range(len(lr.cost_)), lr.cost_)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Iteration: 500/500 | Cost 0.06 | Elapsed: 0:00:00 | ETA: 0:00:00

png

预测类别标签

y_pred = lr.predict(X)
print('Last 3 Class Labels: %s' % y_pred[-3:])

Last 3 Class Labels: [2 2 2]

预测类别概率

y_pred = lr.predict_proba(X)
print('Last 3 Class Labels:\n %s' % y_pred[-3:])

Last 3 Class Labels:
 [[  9.18728149e-09   1.68894679e-02   9.83110523e-01]
 [  2.97052325e-11   7.26356627e-04   9.99273643e-01]
 [  1.57464093e-06   1.57779528e-01   8.42218897e-01]]

示例 2 - 随机梯度下降

from mlxtend.data import iris_data
from mlxtend.plotting import plot_decision_regions
from mlxtend.classifier import SoftmaxRegression
import matplotlib.pyplot as plt

# 加载数据

X, y = iris_data()
X = X[:, [0, 3]] # 萼片长度和花瓣宽度

# 标准化
X[:,0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X[:,1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

lr = SoftmaxRegression(eta=0.01, epochs=300, minibatches=len(y), random_seed=1)
lr.fit(X, y)

plot_decision_regions(X, y, clf=lr)
plt.title('Softmax Regression - Stochastic Gradient Descent')
plt.show()

plt.plot(range(len(lr.cost_)), lr.cost_)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

png

API

SoftmaxRegression(eta=0.01, epochs=50, l2=0.0, minibatches=1, n_classes=None, random_seed=None, print_progress=0)

Softmax regression classifier.

Parameters

eta : float (default: 0.01)

Learning rate (between 0.0 and 1.0)
epochs : int (default: 50)

Passes over the training dataset. Prior to each epoch, the dataset is shuffled if minibatches > 1 to prevent cycles in stochastic gradient descent.
l2 : float

Regularization parameter for L2 regularization. No regularization if l2=0.0.
minibatches : int (default: 1)

The number of minibatches for gradient-based optimization. If 1: Gradient Descent learning If len(y): Stochastic Gradient Descent (SGD) online learning If 1 < minibatches < len(y): SGD Minibatch learning
n_classes : int (default: None)

A positive integer to declare the number of class labels if not all class labels are present in a partial training set. Gets the number of class labels automatically if None.
random_seed : int (default: None)

Set random state for shuffling and initializing the weights.
print_progress : int (default: 0)

Prints progress in fitting to stderr. 0: No output 1: Epochs elapsed and cost 2: 1 plus time elapsed 3: 2 plus estimated time until completion

Attributes

w_ : 2d-array, shape={n_features, 1}

Model weights after fitting.
b_ : 1d-array, shape={1,}

Bias unit after fitting.
cost_ : list

List of floats, the average cross_entropy for each epoch.

Examples

For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/classifier/SoftmaxRegression/

Methods

fit(X, y, init_params=True)

Learn model from training data.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]

Target values.
init_params : bool (default: True)

Re-initializes model parameters prior to fitting. Set False to continue training with weights from a previous model fitting.

Returns

self : object

predict(X)

Predict targets from X.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

target_values : array-like, shape = [n_samples]

Predicted target values.

predict_proba(X)

Predict class probabilities of X from the net input.

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.

Returns

Class probabilties : array-like, shape= [n_samples, n_classes]

score(X, y)

Compute the prediction accuracy

Parameters

X : {array-like, sparse matrix}, shape = [n_samples, n_features]

Training vectors, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape = [n_samples]

Target values (true class labels).

Returns

acc : float

The prediction accuracy as a float between 0.0 and 1.0 (perfect score).

ython