.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/bicluster/plot_bicluster_newsgroups.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_bicluster_plot_bicluster_newsgroups.py: ================================================================ 使用谱协同聚类算法对文档进行双聚类 ================================================================ 本示例演示了在二十新闻组数据集上使用谱协同聚类算法。排除了'comp.os.ms-windows.misc'类别,因为它包含许多仅包含数据的帖子。 TF-IDF向量化后的帖子形成一个词频矩阵,然后使用Dhillon的谱协同聚类算法进行双聚类。由此产生的文档-词双聚类指示了在这些子集文档中更常用的子集词。 对于一些最好的双聚类,打印其最常见的文档类别及其十个最重要的词。最好的双聚类是根据其归一化割确定的。最好的词是通过比较其在双聚类内部和外部的总和确定的。 作为比较,文档也使用MiniBatchKMeans进行聚类。从双聚类中得出的文档聚类比MiniBatchKMeans找到的聚类实现了更好的V-measure。 .. GENERATED FROM PYTHON SOURCE LINES 15-161 .. rst-class:: sphx-glr-script-out .. code-block:: none Vectorizing... Coclustering... Done in 1.49s. V-measure: 0.4415 MiniBatchKMeans... Done in 1.07s. V-measure: 0.3015 Best biclusters: ---------------- bicluster 0 : 8 documents, 6 words categories : 100% talk.politics.mideast words : cosmo, angmar, alfalfa, alphalpha, proline, benson bicluster 1 : 1948 documents, 4325 words categories : 23% talk.politics.guns, 18% talk.politics.misc, 17% sci.med words : gun, guns, geb, banks, gordon, clinton, pitt, cdt, surrender, veal bicluster 2 : 1259 documents, 3534 words categories : 27% soc.religion.christian, 25% talk.politics.mideast, 25% alt.atheism words : god, jesus, christians, kent, sin, objective, belief, christ, faith, moral bicluster 3 : 775 documents, 1623 words categories : 30% comp.windows.x, 25% comp.sys.ibm.pc.hardware, 20% comp.graphics words : scsi, nada, ide, vga, esdi, isa, kth, s3, vlb, bmug bicluster 4 : 2180 documents, 2802 words categories : 18% comp.sys.mac.hardware, 16% sci.electronics, 16% comp.sys.ibm.pc.hardware words : voltage, shipping, circuit, receiver, processing, scope, mpce, analog, kolstad, umass | .. code-block:: Python import operator from collections import defaultdict from time import time import numpy as np from sklearn.cluster import MiniBatchKMeans, SpectralCoclustering from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.cluster import v_measure_score def number_normalizer(tokens): """映射所有数字标记到一个占位符。 对于许多应用,以数字开头的标记并不直接有用,但这种标记的存在可能是相关的。通过应用这种形式的维度降低,一些方法可能会表现得更好。 """ return ("#NUMBER" if token[0].isdigit() else token for token in tokens) class NumberNormalizingVectorizer(TfidfVectorizer): def build_tokenizer(self): tokenize = super().build_tokenizer() return lambda doc: list(number_normalizer(tokenize(doc))) # exclude 'comp.os.ms-windows.misc' categories = [ "alt.atheism", "comp.graphics", "comp.sys.ibm.pc.hardware", "comp.sys.mac.hardware", "comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball", "rec.sport.hockey", "sci.crypt", "sci.electronics", "sci.med", "sci.space", "soc.religion.christian", "talk.politics.guns", "talk.politics.mideast", "talk.politics.misc", "talk.religion.misc", ] newsgroups = fetch_20newsgroups(categories=categories) y_true = newsgroups.target vectorizer = NumberNormalizingVectorizer(stop_words="english", min_df=5) cocluster = SpectralCoclustering( n_clusters=len(categories), svd_method="arpack", random_state=0 ) kmeans = MiniBatchKMeans( n_clusters=len(categories), batch_size=20000, random_state=0, n_init=3 ) print("Vectorizing...") X = vectorizer.fit_transform(newsgroups.data) print("Coclustering...") start_time = time() cocluster.fit(X) y_cocluster = cocluster.row_labels_ print( "Done in {:.2f}s. V-measure: {:.4f}".format( time() - start_time, v_measure_score(y_cocluster, y_true) ) ) print("MiniBatchKMeans...") start_time = time() y_kmeans = kmeans.fit_predict(X) print( "Done in {:.2f}s. V-measure: {:.4f}".format( time() - start_time, v_measure_score(y_kmeans, y_true) ) ) feature_names = vectorizer.get_feature_names_out() document_names = list(newsgroups.target_names[i] for i in newsgroups.target) def bicluster_ncut(i): rows, cols = cocluster.get_indices(i) if not (np.any(rows) and np.any(cols)): import sys return sys.float_info.max row_complement = np.nonzero(np.logical_not(cocluster.rows_[i]))[0] col_complement = np.nonzero(np.logical_not(cocluster.columns_[i]))[0] # 注意:以下内容与X[rows[:, np.newaxis],相同 # cols].sum() but much faster in scipy <= 0.16 weight = X[rows][:, cols].sum() cut = X[row_complement][:, cols].sum() + X[rows][:, col_complement].sum() return cut / weight def most_common(d): """defaultdict(int)中值最高的项。 类似于Python >=2.7中的Counter.most_common。 """ return sorted(d.items(), key=operator.itemgetter(1), reverse=True) bicluster_ncuts = list(bicluster_ncut(i) for i in range(len(newsgroups.target_names))) best_idx = np.argsort(bicluster_ncuts)[:5] print() print("Best biclusters:") print("----------------") for idx, cluster in enumerate(best_idx): n_rows, n_cols = cocluster.get_shape(cluster) cluster_docs, cluster_words = cocluster.get_indices(cluster) if not len(cluster_docs) or not len(cluster_words): continue # categories counter = defaultdict(int) for i in cluster_docs: counter[document_names[i]] += 1 cat_string = ", ".join( "{:.0f}% {}".format(float(c) / n_rows * 100, name) for name, c in most_common(counter)[:3] ) # words out_of_cluster_docs = cocluster.row_labels_ != cluster out_of_cluster_docs = np.where(out_of_cluster_docs)[0] word_col = X[:, cluster_words] word_scores = np.array( word_col[cluster_docs, :].sum(axis=0) - word_col[out_of_cluster_docs, :].sum(axis=0) ) word_scores = word_scores.ravel() important_words = list( feature_names[cluster_words[i]] for i in word_scores.argsort()[:-11:-1] ) print("bicluster {} : {} documents, {} words".format(idx, n_rows, n_cols)) print("categories : {}".format(cat_string)) print("words : {}\n".format(", ".join(important_words))) .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 4.139 seconds) .. _sphx_glr_download_auto_examples_bicluster_plot_bicluster_newsgroups.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/bicluster/plot_bicluster_newsgroups.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_bicluster_newsgroups.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_bicluster_newsgroups.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_bicluster_newsgroups.zip ` .. include:: plot_bicluster_newsgroups.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_