实用函数¶

pyod.utils.data 模块¶

操作数据的实用函数

pyod.utils.data.check_consistent_shape(X_train, y_train, X_test, y_test, y_train_pred, y_test_pred)[源代码]¶

内部形状用于检查输入数据的形状是否一致。

参数:

X_train (numpy array of shape (n_samples, n_features)) – 训练样本。
y_train (list or array of shape (n_samples,)) – 训练样本的地面实况。
X_test (numpy array of shape (n_samples, n_features)) – 测试样本。
y_test (list or array of shape (n_samples,)) – 测试样本的地面实况。
y_train_pred (numpy array of shape (n_samples, n_features)) – 训练样本的预测二进制标签。
y_test_pred (numpy array of shape (n_samples, n_features)) – 测试样本的预测二进制标签。

返回:

X_train (numpy 形状为 (n_samples, n_features) 的数组) – 训练样本。
y_train (list 或 array 形状为 (n_samples,)) – 训练样本的实际值。
X_test (numpy 形状为 (n_samples, n_features) 的数组) – 测试样本。
y_test (list 或 array 形状为 (n_samples,)) – 测试样本的真实值。
y_train_pred (numpy 数组的形状为 (n_samples, n_features)) – 训练样本的预测二进制标签。
y_test_pred (numpy 数组的形状为 (n_samples, n_features)) – 测试样本的预测二进制标签。

pyod.utils.data.evaluate_print(clf_name, y, y_pred)[源代码]¶

用于评估和打印示例结果的实用函数。默认指标包括 ROC 和 Precision @ n

参数:

clf_name (str) – 探测器的名称。
y (list or numpy array of shape (n_samples,)) – 地面实况。二进制（0：内点，1：离群点）。
y_pred (list or numpy array of shape (n_samples,)) – 由拟合模型返回的原始异常分数。

pyod.utils.data.generate_data(n_train=1000, n_test=500, n_features=2, contamination=0.1, train_only=False, offset=10, behaviour='new', random_state=None, n_nan=0, n_inf=0)[源代码]¶

生成合成数据的实用函数。正常数据由多元高斯分布生成，异常值由均匀分布生成。返回 “X_train, X_test, y_train, y_test”。

参数:

n_train (int, (default=1000)) – 要生成的训练点数。
n_test (int, (default=500)) – 要生成的测试点数量。
n_features (int, optional (default=2)) – 特征的数量（维度）。
contamination (float in (0., 0.5), optional (default=0.1)) – 数据集的污染程度，即数据集中异常值的比例。在拟合时用于定义决策函数的阈值。
train_only (bool, optional (default=False)) – 如果为真，则仅生成训练数据。
offset (int, optional (default=10)) – 调整高斯分布和均匀分布的值范围。
behaviour (str, default='new') – 返回数据集的行为可以是’old’或’new’。传递``behaviour=’new’``返回“X_train, X_test, y_train, y_test”，而传递``behaviour=’old’``返回“X_train, y_train, X_test, y_test”。
random_state (int, RandomState instance or None, optional (default=None)) – 如果为整数，random_state 是随机数生成器使用的种子；如果为 RandomState 实例，random_state 是随机数生成器；如果为 None，随机数生成器是 np.random 使用的 RandomState 实例。
n_nan (int) – 缺失值的数量（np.nan）。默认为零。
n_inf (int) – 值的数量是无限的。(np.inf)。默认为零。

返回:

X_train (numpy 形状为 (n_train, n_features) 的数组) – 训练数据。
X_test (numpy 数组的形状 (n_test, n_features)) – 测试数据。
y_train (numpy 形状为 (n_train,) 的数组) – 训练的真实标签。
y_test (numpy 数组，形状为 (n_test,)) – 测试集的真实标签。

pyod.utils.data.generate_data_categorical(n_train=1000, n_test=500, n_features=2, n_informative=2, n_category_in=2, n_category_out=2, contamination=0.1, shuffle=True, random_state=None)[源代码]¶

生成合成分类数据的实用函数。

参数:

n_train (int, (default=1000)) – 要生成的训练点数。
n_test (int, (default=500)) – 要生成的测试点数量。
n_features (int, optional (default=2)) – 每个样本的特征数量。
n_informative (int in (1, n_features), optional (default=2)) – 异常点中的信息特征数量。数量越多，异常检测应该越容易。注意，n_informative 不应小于或等于 n_features。
n_category_in (int in (1, n_inliers), optional (default=2)) – 内点中的类别数量。
n_category_out (int in (1, n_outliers), optional (default=2)) – 离群点中的类别数量。
contamination (float in (0., 0.5), optional (default=0.1)) – 数据集的污染程度，即数据集中异常值的比例。
shuffle (bool, optional(default=True)) – 如果为真，内点将被打乱，这将产生更嘈杂的分布。
random_state (int, RandomState instance or None, optional (default=None)) – 如果为整数，random_state 是随机数生成器使用的种子；如果为 RandomState 实例，random_state 是随机数生成器；如果为 None，随机数生成器是 np.random 使用的 RandomState 实例。

返回:

X_train (numpy 形状为 (n_train, n_features) 的数组) – 训练数据。
y_train (numpy 形状为 (n_train,) 的数组) – 训练的真实标签。
X_test (numpy 数组的形状 (n_test, n_features)) – 测试数据。
y_test (numpy 数组，形状为 (n_test,)) – 测试集的真实标签。

pyod.utils.data.generate_data_clusters(n_train=1000, n_test=500, n_clusters=2, n_features=2, contamination=0.1, size='same', density='same', dist=0.25, random_state=None, return_in_clusters=False)[源代码]¶

用于生成聚类合成数据的实用函数。: 生成的数据可能涉及低密度模式问题和全局异常值，这些被认为是异常值检测算法的难题。

参数:

n_train (int, (default=1000)) – 要生成的训练点数。
n_test (int, (default=500)) – 要生成的测试点数量。
n_clusters (int, optional (default=2)) – 要生成的中心数量（即簇）。
n_features (int, optional (default=2)) – 每个样本的特征数量。
contamination (float in (0., 0.5), optional (default=0.1)) – 数据集的污染程度，即数据集中异常值的比例。
size (str, optional (default='same')) – 每个簇的大小：’same’ 生成相同大小的簇，’different’ 生成不同大小的簇。
density (str, optional (default='same')) – 每个簇的密度：’same’ 生成密度相同的簇，’different’ 生成密度不同的簇。
dist (float, optional (default=0.25)) – 簇之间的距离。应在 0.0 到 1.0 之间。它用于尽可能避免簇的重叠。然而，如果样本数量和簇的数量过多，即使将 dist 设置为 1.0，也很难完全分离它们。
random_state (int, RandomState instance or None, optional (default=None)) – 如果为整数，random_state 是随机数生成器使用的种子；如果为 RandomState 实例，random_state 是随机数生成器；如果为 None，随机数生成器是 np.random 使用的 RandomState 实例。
return_in_clusters (bool, optional (default=False)) – 如果为 True，该函数将返回 x_train、y_train、x_test、y_test，每个都作为 numpy 数组的列表，其中每个索引代表一个簇。如果为 False，它将在连接簇数组序列后返回 x_train、y_train、x_test、y_test 作为 numpy 数组。

返回:

X_train (numpy 形状为 (n_train, n_features) 的数组) – 训练数据。
y_train (numpy 形状为 (n_train,) 的数组) – 训练的真实标签。
X_test (numpy 数组的形状 (n_test, n_features)) – 测试数据。
y_test (numpy 数组，形状为 (n_test,)) – 测试集的真实标签。

pyod.utils.data.get_outliers_inliers(X, y)[源代码]¶

内部方法用于区分内点和外点。

参数:

X (numpy array of shape (n_samples, n_features)) – 输入样本
y (list or array of shape (n_samples,)) – 输入样本的地面实况。

返回:

X_outliers (numpy 数组的形状 (n_samples, n_features)) – 异常值。
X_inliers (numpy 数组的形状 (n_samples, n_features)) – 内点。

pyod.utils.example 模块¶

运行示例的实用函数

pyod.utils.example.data_visualize(X_train, y_train, show_figure=True, save_figure=False)[源代码]¶

用于可视化由 generate_data_cluster 函数生成的合成样本的实用函数。

参数:

X_train (numpy array of shape (n_samples, n_features)) – 训练样本。
y_train (list or array of shape (n_samples,)) – 训练样本的地面实况。
show_figure (bool, optional (default=True)) – 如果设置为 True，显示图形。
save_figure (bool, optional (default=False)) – 如果设置为 True，将图形保存到本地。

pyod.utils.example.visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred, y_test_pred, show_figure=True, save_figure=False)[源代码]¶

用于在示例中可视化结果的实用函数。仅供内部使用。

参数:

clf_name (str) – 探测器的名称。
X_train (numpy array of shape (n_samples, n_features)) – 训练样本。
y_train (list or array of shape (n_samples,)) – 训练样本的地面实况。
X_test (numpy array of shape (n_samples, n_features)) – 测试样本。
y_test (list or array of shape (n_samples,)) – 测试样本的地面实况。
y_train_pred (numpy array of shape (n_samples, n_features)) – 训练样本的预测二进制标签。
y_test_pred (numpy array of shape (n_samples, n_features)) – 测试样本的预测二进制标签。
show_figure (bool, optional (default=True)) – 如果设置为 True，显示图形。
save_figure (bool, optional (default=False)) – 如果设置为 True，将图形保存到本地。

pyod.utils.stat_models 模块¶

统计模型的集合

pyod.utils.stat_models.column_ecdf(matrix: ndarray) → ndarray[源代码]¶

计算二维特征矩阵的列方向经验累积分布的实用函数，其中行是样本，列是每个样本的特征。累积是在样本轴的正方向上进行的。

例如，p(1) = 0.2, p(0) = 0.3, p(2) = 0.1, p(6) = 0.4 ECDF E(5) = p(x <= 5) ECDF E 将是 E(-1) = 0, E(0) = 0.3, E(1) = 0.5, E(2) = 0.6, E(3) = 0.6, E(4) = 0.6, E(5) = 0.6, E(6) = 1

类似于并经过测试：https://www.statsmodels.org/stable/generated/statsmodels.distributions.empirical_distribution.ECDF.html

pyod.utils.stat_models.ecdf_terminate_equals_inplace(matrix: ndarray, probabilities: ndarray)[源代码]¶

这是一个用于计算数组ecdf的辅助函数。它已从原始函数中分离出来，以便能够使用numpy的njit编译器来提高速度，因为它不幸需要遍历矩阵的所有行和列。它在概率矩阵上就地操作。

参数:

matrix (a feature matrix where the rows are samples and each column is a feature !(expected to be sorted)!)
probabilities (a probability matrix that will be used building the ecdf. It has values between 0 and 1 and) – 也进行了排序。

pyod.utils.stat_models.pairwise_distances_no_broadcast(X, Y)[源代码]¶

用于计算两个矩阵行间欧几里得距离的实用函数。与成对计算不同，此函数不会广播。

例如，X 和 Y 都是 (4,3) 矩阵，该函数将返回一个形状为 (4,) 的距离向量，而不是 (4,4)。

参数:

X (array of shape (n_samples, n_features)) – 第一个输入样本
Y (array of shape (n_samples, n_features)) – 第二个输入样本

返回:

distance – X 和 Y 的逐行欧几里得距离

返回类型:

array of shape (n_samples,)

pyod.utils.stat_models.pearsonr_mat(mat, w=None)[源代码]¶

计算皮尔逊矩阵（按行）的实用函数。

参数:

mat (numpy array of shape (n_samples, n_features)) – 输入矩阵。
w (numpy array of shape (n_features,)) – 权重。

返回:

pear_mat – 按行计算的皮尔逊得分矩阵。

返回类型:

numpy array of shape (n_samples, n_samples)

pyod.utils.stat_models.wpearsonr(x, y, w=None)[源代码]¶

用于计算两个样本的加权皮尔逊相关系数的实用函数。

参数:

x (array, shape (n,)) – 输入 x。
y (array, shape (n,)) – 输入 y。
w (array, shape (n,)) – 权重 w.

返回:

scores – x 和 y 之间的加权皮尔逊相关系数。

返回类型:

float in range of [-1,1]

pyod.utils.utility 模块¶

一组支持异常检测的实用函数。

pyod.utils.utility.argmaxn(value_list, n, order='desc')[源代码]¶

如果顺序设置为’desc’，则返回列表中前n个元素的索引，否则返回n个最小元素的索引。

参数:

value_list (list, array, numpy array of shape (n_samples,)) – 包含所有值的列表。
n (int) – 要选择的元素数量。
order (str, optional (default='desc')) – 排序顺序 {‘desc’, ‘asc’}： - ‘desc’: 降序 - ‘asc’: 升序

返回:

index_list – 前 n 个元素的索引。

返回类型:

numpy array of shape (n,)

pyod.utils.utility.check_detector(detector)[源代码]¶

检查给定检测器是否存在 fit 和 decision_function 方法

参数:: detector (pyod.models) – 执行检查的检测器实例。

pyod.utils.utility.check_parameter(param, low=-2147483647, high=2147483647, param_name='', include_left=False, include_right=False)[源代码]¶

检查输入是否在定义的范围内。

参数:

param (int, float) – 要检查的输入参数。
low (int, float) – 范围的下限。
high (int, float) – 范围的上限。
param_name (str, optional (default='')) – 参数的名称。
include_left (bool, optional (default=False)) – 是否包含下限（下限 <=）。
include_right (bool, optional (default=False)) – 是否包含上限（<= 上限）。

返回:

within_range – 参数是否在 (low, high) 的范围内

返回类型:

bool or raise errors

pyod.utils.utility.generate_bagging_indices(random_state, bootstrap_features, n_features, min_features, max_features)[源代码]¶

随机抽取特征索引。仅限内部使用。

从 sklearn/ensemble/bagging.py 修改

参数:

random_state (RandomState) – 一个随机数生成器实例，用于定义随机排列生成器的状态。
bootstrap_features (bool) – 指定是否引导索引生成
n_features (int) – 指定生成索引时的人口规模
min_features (int) – 随机采样的特征数量下限
max_features (int) – 随机采样的特征数量上限

返回:

feature_indices – 用于特征打包的索引

返回类型:

numpy array, shape (n_samples,)

pyod.utils.utility.generate_indices(random_state, bootstrap, n_population, n_samples)[源代码]¶

绘制随机采样的索引。仅供内部使用。

参见 sklearn/ensemble/bagging.py

参数:

random_state (RandomState) – 一个随机数生成器实例，用于定义随机排列生成器的状态。
bootstrap (bool) – 指定是否引导索引生成
n_population (int) – 指定生成索引时的人口规模
n_samples (int) – 指定要抽取的样本数量

返回:

indices – 随机抽取的索引

返回类型:

numpy array, shape (n_samples,)

pyod.utils.utility.get_diff_elements(li1, li2)[源代码]¶

获取li1中存在但li2中不存在的元素，反之亦然

参数:

li1 (list or numpy array) – 输入列表 1.
li2 (list or numpy array) – 输入列表 2.

返回:

difference – li1 和 li2 之间的差异。

返回类型:

list

pyod.utils.utility.get_intersection(lst1, lst2)[源代码]¶

获取两个列表之间的重叠部分

参数:

li1 (list or numpy array) – 输入列表 1.
li2 (list or numpy array) – 输入列表 2.

返回:

difference – li1 和 li2 之间的重叠。

返回类型:

list

pyod.utils.utility.get_label_n(y, y_pred, n=None)[源代码]¶

通过将前 n 个异常分数赋值为 1，将原始异常分数转换为二进制标签的函数。

参数:

y (list or numpy array of shape (n_samples,)) – 地面实况。二进制（0：内点，1：离群点）。
y_pred (list or numpy array of shape (n_samples,)) – 由拟合模型返回的原始异常分数。
n (int, optional (default=None)) – 异常值的数量。如果未定义，则使用地面实况进行推断。

返回:

标签 – 二进制标签 0: 正常点和 1: 异常点

返回类型:

numpy array of shape (n_samples,)

示例

>>> from pyod.utils.utility import get_label_n
>>> y = [0, 1, 1, 0, 0]
>>> y_pred = [0.1, 0.5, 0.3, 0.2, 0.7]
>>> get_label_n(y, y_pred)
array([0, 1, 0, 0, 1])

pyod.utils.utility.get_list_diff(li1, li2)[源代码]¶

获取li1中但不在li2中的元素。li1-li2

参数:

li1 (list or numpy array) – 输入列表 1.
li2 (list or numpy array) – 输入列表 2.

返回:

difference – li1 和 li2 之间的差异。

返回类型:

list

pyod.utils.utility.get_optimal_n_bins(X, upper_bound=None, epsilon=1)[源代码]¶

使用Birge Rozenblac方法确定直方图的最佳分箱数（详见 [BBirgeR06]）。

参见 https://doi.org/10.1051/ps:2006001

参数:

X (array-like of shape (n_samples, n_features)) – 用于确定最佳分箱数的样本。
upper_bound (int, default=None) – 要考虑的 n_bins 的最大值。如果设置为 None，将使用 np.sqrt(X.shape[0]) 作为上限。
epsilon (float, default = 1) – 添加到对数中的一个稳定项，以防止除以零。

返回:

optimal_n_bins – 根据Birge Rozenblac方法得出的n_bins的最优值

返回类型:

int

pyod.utils.utility.invert_order(scores, method='multiplication')[源代码]¶

反转一组值的顺序。最小的值在反转后的列表中变为最大的值。这在合并多个检测器时很有用，因为它们的分数顺序可能不同。

参数:

scores (list, array or numpy array with shape (n_samples,)) – 要反转的值列表
method (str, optional (default='multiplication')) – 用于顺序反转的方法。有效的方法有： - ‘乘法’：乘以 -1 - ‘减法’：max(scores) - scores

返回:

inverted_scores – 倒排列表

返回类型:

numpy array of shape (n_samples,)

示例

>>> scores1 = [0.1, 0.3, 0.5, 0.7, 0.2, 0.1]
>>> invert_order(scores1)
array([-0.1, -0.3, -0.5, -0.7, -0.2, -0.1])
>>> invert_order(scores1, method='subtraction')
array([0.6, 0.4, 0.2, 0. , 0.5, 0.6])

pyod.utils.utility.precision_n_scores(y, y_pred, n=None)[源代码]¶

用于计算排名 n 的精确度的实用函数。

参数:

y (list or numpy array of shape (n_samples,)) – 地面实况。二进制（0：内点，1：离群点）。
y_pred (list or numpy array of shape (n_samples,)) – 由拟合模型返回的原始异常分数。
n (int, optional (default=None)) – 异常值的数量。如果未定义，则使用地面实况进行推断。

返回:

precision_at_rank_n – 排名n处的精确度得分。

返回类型:

float

pyod.utils.utility.score_to_label(pred_scores, outliers_fraction=0.1)[源代码]¶

将原始的异常值分数转换为二进制标签（0 或 1）。

参数:

pred_scores (list or numpy array of shape (n_samples,)) – 原始的异常值分数。假设异常值具有较大的值。
outliers_fraction (float in (0,1)) – 异常值的百分比。

返回:

outlier_labels – 对于每个观测值，根据拟合的模型判断是否应被视为异常值。返回异常值的概率，范围在 [0,1] 之间。

返回类型:

numpy array of shape (n_samples,)

pyod.utils.utility.standardizer(X, X_t=None, keep_scalar=False)[源代码]¶

对数据进行Z归一化，使输入样本变为零均值和单位方差。

参数:

X (numpy array of shape (n_samples, n_features)) – 训练样本
X_t (numpy array of shape (n_samples_new, n_features), optional (default=None)) – 要转换的数据
keep_scalar (bool, optional (default=False)) – 指示是否返回标量的标志

返回:

X_norm (numpy 数组，形状为 (n_samples, n_features)) – 经过 Z-score 标准化的 X
X_t_norm (numpy 数组的形状为 (n_samples, n_features)) – 经过 Z-score 标准化的 X_t
scalar (sklearn 标量对象) – 用于转换的标量