示例



kNN 示例

完整示例: knn_example.py

  1. 导入模型

    from pyod.models.knn import KNN   # kNN detector
    
  2. 使用 pyod.utils.data.generate_data() 生成样本数据:

    contamination = 0.1  # percentage of outliers
    n_train = 200  # number of training points
    n_test = 100  # number of testing points
    
    X_train, X_test, y_train, y_test = generate_data(
        n_train=n_train, n_test=n_test, contamination=contamination)
    
  3. 初始化一个 pyod.models.knn.KNN 检测器,拟合模型,并进行预测。

    # train kNN detector
    clf_name = 'KNN'
    clf = KNN()
    clf.fit(X_train)
    
    # get the prediction labels and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores
    
    # get the prediction on the test data
    y_test_pred = clf.predict(X_test)  # outlier labels (0 or 1)
    y_test_scores = clf.decision_function(X_test)  # outlier scores
    
    # it is possible to get the prediction confidence as well
    y_test_pred, y_test_pred_confidence = clf.predict(X_test, return_confidence=True)  # outlier labels (0 or 1) and confidence in the range of [0,1]
    
  4. 使用 ROC 和 Precision @ Rank n 评估预测 pyod.utils.data.evaluate_print()

    from pyod.utils.data import evaluate_print
    # evaluate and print the results
    print("\nOn Training Data:")
    evaluate_print(clf_name, y_train, y_train_scores)
    print("\nOn Test Data:")
    evaluate_print(clf_name, y_test, y_test_scores)
    
  5. 查看训练和测试数据上的样本输出。

    On Training Data:
    KNN ROC:1.0, precision @ rank n:1.0
    
    On Test Data:
    KNN ROC:0.9989, precision @ rank n:0.9
    
  6. 通过所有示例中包含的 visualize 函数生成可视化。

    visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
              y_test_pred, show_figure=True, save_figure=False)
    
kNN 演示

模型组合示例

由于其无监督的特性,异常检测常常受到模型不稳定性的影响。因此,建议通过平均等方式结合多种检测器的输出,以提高其鲁棒性。检测器组合是异常检测集成的一个子领域;更多信息请参考 [BKalayciE18]

本演示展示了四种得分组合机制:

  1. 平均值: 所有检测器的平均分数。

  2. 最大化:所有检测器中的最高分数。

  3. 最大值的平均值 (AOM):将基础检测器分成子组,并为每个子组取最大分数。最终分数是所有子组分数的平均值。

  4. 最大平均值 (MOA):将基础检测器分成子组,并对每个子组取平均分。最终得分是所有子组得分的最大值。

“examples/comb_example.py” 展示了组合多个基础检测器输出的API(comb_example.py, Jupyter Notebooks)。对于Jupyter Notebooks,请导航至 “/notebooks/Model Combination.ipynb”

  1. 导入模型并生成示例数据。

    from pyod.models.knn import KNN  # kNN detector
    from pyod.models.combination import aom, moa, average, maximization
    from pyod.utils.data import generate_data
    
    X, y= generate_data(train_only=True)  # load data
    
  2. 初始化20个不同的k(从10到200)的kNN异常检测器,并获取异常分数。

    # initialize 20 base detectors for combination
    k_list = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140,
                150, 160, 170, 180, 190, 200]
    n_clf = len(k_list) # Number of classifiers being trained
    
    train_scores = np.zeros([X_train.shape[0], n_clf])
    test_scores = np.zeros([X_test.shape[0], n_clf])
    
    for i in range(n_clf):
        k = k_list[i]
    
        clf = KNN(n_neighbors=k, method='largest')
        clf.fit(X_train_norm)
    
        train_scores[:, i] = clf.decision_scores_
        test_scores[:, i] = clf.decision_function(X_test_norm)
    
  3. 然后,输出分数在组合之前被标准化为零平均值和单位标准差。这一步对于将检测器输出调整为相同尺度至关重要。

    from pyod.utils.utility import standardizer
    
    # scores have to be normalized before combination
    train_scores_norm, test_scores_norm = standardizer(train_scores, test_scores)
    
  4. 如上所述,应用了四种不同的组合算法:

    comb_by_average = average(test_scores_norm)
    comb_by_maximization = maximization(test_scores_norm)
    comb_by_aom = aom(test_scores_norm, 5) # 5 groups
    comb_by_moa = moa(test_scores_norm, 5) # 5 groups
    
  5. 最后,所有四种组合方法都通过ROC和Precision @ Rank n进行评估:

    Combining 20 kNN detectors
    Combination by Average ROC:0.9194, precision @ rank n:0.4531
    Combination by Maximization ROC:0.9198, precision @ rank n:0.4688
    Combination by AOM ROC:0.9257, precision @ rank n:0.4844
    Combination by MOA ROC:0.9263, precision @ rank n:0.4688
    

阈值化示例

完整示例:threshold_example.py

  1. 导入模型

    from pyod.models.knn import KNN   # kNN detector
    from pyod.models.thresholds import FILTER  # Filter thresholder
    
  2. 使用 pyod.utils.data.generate_data() 生成样本数据:

    contamination = 0.1  # percentage of outliers
    n_train = 200  # number of training points
    n_test = 100  # number of testing points
    
    X_train, X_test, y_train, y_test = generate_data(
        n_train=n_train, n_test=n_test, contamination=contamination)
    
  3. 初始化一个 pyod.models.knn.KNN 检测器,拟合模型,并进行预测。

    # train kNN detector and apply FILTER thresholding
    clf_name = 'KNN'
    clf = KNN(contamination=FILTER())
    clf.fit(X_train)
    
    # get the prediction labels and outlier scores of the training data
    y_train_pred = clf.labels_  # binary labels (0: inliers, 1: outliers)
    y_train_scores = clf.decision_scores_  # raw outlier scores
    

参考文献

[BKalayciE18] (1,2,3)

İlker Kalaycı and Tuncay Ercan. Anomaly detection in wireless sensor networks data by using histogram based outlier score method. In 2018 2nd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), 1–6. IEEE, 2018.