OneR分类器:用于分类的单规则(OneR)方法
对分类的单规则(OneR)方法的实现。
# 导入库
从 mlxtend.classifier 导入 OneRClassifier
概述
"OneR" 代表单规则(由 Robert Holte 提出 [1]),这是一种经典的监督学习算法。请注意,该算法并不以良好的预测性能而闻名;因此,它更适合用于教学目的和实际应用中的下限性能基准。
名称 "OneRule" 可能有点误导,因为它实际上是关于 "一个特征" 而不是 "一个规则"。也就是说,OneR 返回一个特征,对于该特征定义了一个或多个决策规则。本质上,作为一种简单的分类器,它恰好找到一个特征(以及该特征的一个或多个特征值)来对数据实例进行分类。
基本程序如下:
- 对于数据集中所有特征(列)中的每个特征:
- 对于给定特征的每个特征值:
- 获取具有该特征值的训练示例。
- 获取与前一步中识别的训练示例对应的类标签(和类标签计数)。
- 将频率最高的类标签(计数)视为主要类别。
- 将错误数量记录为具有给定特征值但不是主要类别的训练示例数量。
- 通过对该特征的所有可能特征值的错误进行求和,计算该特征的错误。
- 对于给定特征的每个特征值:
- 返回最佳特征,定义为具有最低错误的特征。
请注意,OneR 算法假定特征值是分类的(或离散化的)。有关 OneR 分类器的良好解释,请参见《可解释机器学习》在线章节 "4.5.1 从单个特征学习规则(OneR)"(https://christophm.github.io/interpretable-ml-book/rules.html, [2])。
参考文献
[1] Holte, Robert C. "非常简单的分类规则在大多数常用数据集上表现良好。" 机器学习 11.1 (1993): 63-90.
[2] 可解释机器学习 (2018) 由 Christoph Molnar 编著: https://christophm.github.io/interpretable-ml-book/rules.html
示例 1 -- 在离散化的鸢尾花数据集上演示 OneR
如上所述,OneR算法期望使用分类或离散化特征。MLxtend中的OneRClassifier
实现不会修改数据集中的特征,确保特征为分类特征是用户的责任。
在以下示例中,我们将对鸢尾花数据集进行离散化。具体来说,我们将数据集转换为四分位数。换句话说,每个特征值都将被替换为分类值。对于花萼宽度(鸢尾花中的第一列),这将是
- (0, 5.1] => 0
- (5.1, 5.8] => 1
- (5.8, 6.4] => 2
- (6.4, 7.9] => 3
下面是原始鸢尾花数据的前15行(花):
from mlxtend.data import iris_data
X, y = iris_data()
X[:15]
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1],
[5.4, 3.7, 1.5, 0.2],
[4.8, 3.4, 1.6, 0.2],
[4.8, 3. , 1.4, 0.1],
[4.3, 3. , 1.1, 0.1],
[5.8, 4. , 1.2, 0.2]])
下面是离散化的数据集。每个特征被划分为4个四分位数。
import numpy as np
def get_feature_quartiles(X):
X_discretized = X.copy()
for col in range(X.shape[1]):
for q, class_label in zip([1.0, 0.75, 0.5, 0.25], [3, 2, 1, 0]):
threshold = np.quantile(X[:, col], q=q)
X_discretized[X[:, col] <= threshold, col] = class_label
return X_discretized.astype(np.int)
Xd = get_feature_quartiles(X)
Xd[:15]
array([[0, 3, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[0, 2, 0, 0],
[0, 3, 0, 0],
[1, 3, 1, 1],
[0, 3, 0, 0],
[0, 3, 0, 0],
[0, 1, 0, 0],
[0, 2, 0, 0],
[1, 3, 0, 0],
[0, 3, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[1, 3, 0, 0]])
给定一个包含分类特征的数据集,我们可以使用OneR分类器,类似于scikit-learn中的估计器进行分类。首先,让我们将数据集分为训练数据和测试数据:
from sklearn.model_selection import train_test_split
Xd_train, Xd_test, y_train, y_test = train_test_split(Xd, y, random_state=0, stratify=y)
接下来,我们可以使用 fit
方法在训练集上训练 OneRClassifier
模型:
from mlxtend.classifier import OneRClassifier
oner = OneRClassifier()
oner.fit(Xd_train, y_train);
所选特征的列索引可以通过模型拟合后的 feature_idx_
属性访问:
oner.feature_idx_
2
在模型拟合后,也可以使用 prediction_dict_
。它列出了所选特征的总误差(即,列在 feature_idx_
下的特征)。此外,它还提供了分类规则:
oner.prediction_dict_
{'total error': 16, 'rules (value: class)': {0: 0, 1: 1, 2: 1, 3: 2}}
即,'rules (value: class)': {0: 0, 1: 1, 2: 1, 3: 2}
表示对于选定特征(花瓣长度),共有 3 条规则:
- 如果值为 0,则分类为 0(虹膜山翘)
- 如果值为 1,则分类为 1(虹膜变色)
- 如果值为 2,则分类为 1(虹膜变色)
- 如果值为 3,则分类为 2(虹膜维尔吉尼卡)
在模型拟合后,我们可以使用 oner
对象来进行预测:
oner.predict(Xd_train)
array([1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 2, 2, 1,
0, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 0, 0, 1, 2, 1, 1, 2, 2, 1, 0, 1,
1, 1, 2, 0, 1, 2, 1, 2, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 2, 0, 1, 1,
0, 1, 2, 1, 2, 0, 1, 2, 1, 1, 2, 0, 1, 0, 0, 1, 1, 2, 0, 0, 0, 1,
0, 1, 2, 2, 2, 0, 1, 0, 2, 0, 1, 1, 1, 1, 0, 2, 2, 0, 1, 1, 0, 2,
1, 2])
y_pred = oner.predict(Xd_train)
train_acc = np.mean(y_pred == y_train)
print(f'Training accuracy {train_acc*100:.2f}%')
Training accuracy 85.71%
y_pred = oner.predict(Xd_test)
test_acc = np.mean(y_pred == y_test)
print(f'Test accuracy {test_acc*100:.2f}%')
Test accuracy 84.21%
我们还可以使用 score
方法,而不是手动计算预测准确率,如上所示:
test_acc = oner.score(Xd_test, y_test)
print(f'Test accuracy {test_acc*100:.2f}%')
Test accuracy 84.21%
API
OneRClassifier(resolve_ties='first')
OneR (One Rule) Classifier.
Parameters
-
resolve_ties
: str (default: 'first')Option for how to resolve ties if two or more features have the same error. Options are - 'first' (default): chooses first feature in the list, i.e., feature with the lower column index. - 'chi-squared': performs a chi-squared test for each feature against the target and selects the feature with the lowest p-value.
Attributes
-
self.classes_labels_
: array-like, shape = [n_labels]Array containing the unique class labels found in the training set.
-
self.feature_idx_
: intThe index of the rules' feature based on the column in the training set.
-
self.p_value_
: floatThe p value for a given feature. Only available after calling
fit
when the OneR attributeresolve_ties = 'chi-squared'
is set. -
self.prediction_dict_
: dictDictionary containing information about the feature's (self.feature_idx_) rules and total error. E.g.,
{'total error': 37, 'rules (value: class)': {0: 0, 1: 2}}
means the total error is 37, and the rules are "if feature value == 0 classify as 0" and "if feature value == 1 classify as 2". (And classify as class 1 otherwise.)For usage examples, please see https://rasbt.github.io/mlxtend/user_guide/classifier/OneRClassifier/
Methods
fit(X, y)
Learn rule from training data.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
-
y
: array-like, shape = [n_samples]Target values.
Returns
self
: object
get_params(deep=True)
Get parameters for this estimator.
Parameters
-
deep
: bool, default=TrueIf True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns
-
params
: mapping of string to anyParameter names mapped to their values.
predict(X)
Predict class labels for X.
Parameters
-
X
: {array-like, sparse matrix}, shape = [n_samples, n_features]Training vectors, where n_samples is the number of samples and n_features is the number of features.
Returns
-
maj
: array-like, shape = [n_samples]Predicted class labels.
score(X, y, sample_weight=None)
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.
Parameters
-
X
: array-like of shape (n_samples, n_features)Test samples.
-
y
: array-like of shape (n_samples,) or (n_samples, n_outputs)True labels for X.
-
sample_weight
: array-like of shape (n_samples,), default=NoneSample weights.
Returns
-
score
: floatMean accuracy of self.predict(X) wrt. y.
set_params(params)
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it's possible to update each
component of a nested object.
Parameters
-
**params
: dictEstimator parameters.
Returns
-
self
: objectEstimator instance.
ython