.. DO NOT EDIT.
.. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY.
.. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE:
.. "auto_examples/applications/plot_topics_extraction_with_nmf_lda.py"
.. LINE NUMBERS ARE GIVEN BELOW.

.. only:: html

    .. note::
        :class: sphx-glr-download-link-note

        :ref:`Go to the end <sphx_glr_download_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py>`
        to download the full example code. or to run this example in your browser via Binder

.. rst-class:: sphx-glr-example-title

.. _sphx_glr_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py:


=======================================================================================
使用非负矩阵分解和潜在狄利克雷分配进行主题提取
=======================================================================================

这是一个应用 :class:`~sklearn.decomposition.NMF` 和 :class:`~sklearn.decomposition.LatentDirichletAllocation` 在文档语料库上提取语料库主题结构的加性模型的示例。输出是一个主题的图，每个主题用基于权重的前几个词表示为条形图。

非负矩阵分解应用了两种不同的目标函数：Frobenius 范数和广义 Kullback-Leibler 散度。后者等同于概率潜在语义索引。

默认参数（n_samples / n_features / n_components）应使示例在几十秒内可运行。您可以尝试增加问题的维度，但请注意，在 NMF 中时间复杂度是多项式的。在 LDA 中，时间复杂度与 (n_samples * 迭代次数) 成正比。

.. GENERATED FROM PYTHON SOURCE LINES 13-212


.. rst-class:: sphx-glr-horizontal


    *

      .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_001.png
         :alt: Topics in NMF model (Frobenius norm), Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8, Topic 9, Topic 10
         :srcset: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_001.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_002.png
         :alt: Topics in NMF model (generalized Kullback-Leibler divergence), Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8, Topic 9, Topic 10
         :srcset: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_002.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_003.png
         :alt: Topics in MiniBatchNMF model (Frobenius norm), Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8, Topic 9, Topic 10
         :srcset: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_003.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_004.png
         :alt: Topics in MiniBatchNMF model (generalized Kullback-Leibler divergence), Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8, Topic 9, Topic 10
         :srcset: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_004.png
         :class: sphx-glr-multi-img

    *

      .. image-sg:: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_005.png
         :alt: Topics in LDA model, Topic 1, Topic 2, Topic 3, Topic 4, Topic 5, Topic 6, Topic 7, Topic 8, Topic 9, Topic 10
         :srcset: /auto_examples/applications/images/sphx_glr_plot_topics_extraction_with_nmf_lda_005.png
         :class: sphx-glr-multi-img


.. rst-class:: sphx-glr-script-out

 .. code-block:: none

    Loading dataset...
    done in 0.704s.
    Extracting tf-idf features for NMF...
    done in 0.146s.
    Extracting tf features for LDA...
    done in 0.133s.

    Fitting the NMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000...
    done in 0.110s.


     Fitting the NMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000...
    done in 0.551s.


     Fitting the MiniBatchNMF model (Frobenius norm) with tf-idf features, n_samples=2000 and n_features=1000, batch_size=128...
    done in 0.062s.


     Fitting the MiniBatchNMF model (generalized Kullback-Leibler divergence) with tf-idf features, n_samples=2000 and n_features=1000, batch_size=128...
    done in 0.098s.


     Fitting LDA models with tf features, n_samples=2000 and n_features=1000...
    done in 1.496s.


|

.. code-block:: Python


    # 作者：scikit-learn 开发者
    # SPDX-License-Identifier: BSD-3-Clause

    from time import time

    import matplotlib.pyplot as plt

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.decomposition import NMF, LatentDirichletAllocation, MiniBatchNMF
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

    n_samples = 2000
    n_features = 1000
    n_components = 10
    n_top_words = 20
    batch_size = 128
    init = "nndsvda"


    def plot_top_words(model, feature_names, n_top_words, title):
        fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
        axes = axes.flatten()
        for topic_idx, topic in enumerate(model.components_):
            top_features_ind = topic.argsort()[-n_top_words:]
            top_features = feature_names[top_features_ind]
            weights = topic[top_features_ind]

            ax = axes[topic_idx]
            ax.barh(top_features, weights, height=0.7)
            ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
            ax.tick_params(axis="both", which="major", labelsize=20)
            for i in "top right left".split():
                ax.spines[i].set_visible(False)
            fig.suptitle(title, fontsize=40)

        plt.subplots_adjust(top=0.90, bottom=0.05, wspace=0.90, hspace=0.3)
        plt.show()


    # 加载20个新闻组数据集并对其进行向量化。我们使用了一些启发式方法来提前过滤掉无用的术语：帖子被去除了标题、页脚和引用的回复，常见的英语单词、仅出现在一个文档中或至少出现在95%文档中的单词也被移除。

    print("Loading dataset...")
    t0 = time()
    data, _ = fetch_20newsgroups(
        shuffle=True,
        random_state=1,
        remove=("headers", "footers", "quotes"),
        return_X_y=True,
    )
    data_samples = data[:n_samples]
    print("done in %0.3fs." % (time() - t0))

    # 使用tf-idf特征进行NMF。
    print("Extracting tf-idf features for NMF...")
    tfidf_vectorizer = TfidfVectorizer(
        max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
    )
    t0 = time()
    tfidf = tfidf_vectorizer.fit_transform(data_samples)
    print("done in %0.3fs." % (time() - t0))

    # 使用 tf（原始词频）特征进行 LDA。
    print("Extracting tf features for LDA...")
    tf_vectorizer = CountVectorizer(
        max_df=0.95, min_df=2, max_features=n_features, stop_words="english"
    )
    t0 = time()
    tf = tf_vectorizer.fit_transform(data_samples)
    print("done in %0.3fs." % (time() - t0))
    print()

    # 拟合NMF模型
    print(
        "Fitting the NMF model (Frobenius norm) with tf-idf features, "
        "n_samples=%d and n_features=%d..." % (n_samples, n_features)
    )
    t0 = time()
    nmf = NMF(
        n_components=n_components,
        random_state=1,
        init=init,
        beta_loss="frobenius",
        alpha_W=0.00005,
        alpha_H=0.00005,
        l1_ratio=1,
    ).fit(tfidf)
    print("done in %0.3fs." % (time() - t0))


    tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
    plot_top_words(
        nmf, tfidf_feature_names, n_top_words, "Topics in NMF model (Frobenius norm)"
    )

    # 拟合NMF模型
    print(
        "\n" * 2,
        "Fitting the NMF model (generalized Kullback-Leibler "
        "divergence) with tf-idf features, n_samples=%d and n_features=%d..."
        % (n_samples, n_features),
    )
    t0 = time()
    nmf = NMF(
        n_components=n_components,
        random_state=1,
        init=init,
        beta_loss="kullback-leibler",
        solver="mu",
        max_iter=1000,
        alpha_W=0.00005,
        alpha_H=0.00005,
        l1_ratio=0.5,
    ).fit(tfidf)
    print("done in %0.3fs." % (time() - t0))

    tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
    plot_top_words(
        nmf,
        tfidf_feature_names,
        n_top_words,
        "Topics in NMF model (generalized Kullback-Leibler divergence)",
    )

    # 拟合 MiniBatchNMF 模型
    print(
        "\n" * 2,
        "Fitting the MiniBatchNMF model (Frobenius norm) with tf-idf "
        "features, n_samples=%d and n_features=%d, batch_size=%d..."
        % (n_samples, n_features, batch_size),
    )
    t0 = time()
    mbnmf = MiniBatchNMF(
        n_components=n_components,
        random_state=1,
        batch_size=batch_size,
        init=init,
        beta_loss="frobenius",
        alpha_W=0.00005,
        alpha_H=0.00005,
        l1_ratio=0.5,
    ).fit(tfidf)
    print("done in %0.3fs." % (time() - t0))


    tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
    plot_top_words(
        mbnmf,
        tfidf_feature_names,
        n_top_words,
        "Topics in MiniBatchNMF model (Frobenius norm)",
    )

    # 拟合 MiniBatchNMF 模型
    print(
        "\n" * 2,
        "Fitting the MiniBatchNMF model (generalized Kullback-Leibler "
        "divergence) with tf-idf features, n_samples=%d and n_features=%d, "
        "batch_size=%d..." % (n_samples, n_features, batch_size),
    )
    t0 = time()
    mbnmf = MiniBatchNMF(
        n_components=n_components,
        random_state=1,
        batch_size=batch_size,
        init=init,
        beta_loss="kullback-leibler",
        alpha_W=0.00005,
        alpha_H=0.00005,
        l1_ratio=0.5,
    ).fit(tfidf)
    print("done in %0.3fs." % (time() - t0))

    tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
    plot_top_words(
        mbnmf,
        tfidf_feature_names,
        n_top_words,
        "Topics in MiniBatchNMF model (generalized Kullback-Leibler divergence)",
    )

    print(
        "\n" * 2,
        "Fitting LDA models with tf features, n_samples=%d and n_features=%d..."
        % (n_samples, n_features),
    )
    lda = LatentDirichletAllocation(
        n_components=n_components,
        max_iter=5,
        learning_method="online",
        learning_offset=50.0,
        random_state=0,
    )
    t0 = time()
    lda.fit(tf)
    print("done in %0.3fs." % (time() - t0))

    tf_feature_names = tf_vectorizer.get_feature_names_out()
    plot_top_words(lda, tf_feature_names, n_top_words, "Topics in LDA model")


.. rst-class:: sphx-glr-timing

   **Total running time of the script:** (0 minutes 5.812 seconds)


.. _sphx_glr_download_auto_examples_applications_plot_topics_extraction_with_nmf_lda.py:

.. only:: html

  .. container:: sphx-glr-footer sphx-glr-footer-example

    .. container:: binder-badge

      .. image:: images/binder_badge_logo.svg
        :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/applications/plot_topics_extraction_with_nmf_lda.ipynb
        :alt: Launch binder
        :width: 150 px

    .. container:: sphx-glr-download sphx-glr-download-jupyter

      :download:`Download Jupyter notebook: plot_topics_extraction_with_nmf_lda.ipynb <plot_topics_extraction_with_nmf_lda.ipynb>`

    .. container:: sphx-glr-download sphx-glr-download-python

      :download:`Download Python source code: plot_topics_extraction_with_nmf_lda.py <plot_topics_extraction_with_nmf_lda.py>`

    .. container:: sphx-glr-download sphx-glr-download-zip

      :download:`Download zipped: plot_topics_extraction_with_nmf_lda.zip <plot_topics_extraction_with_nmf_lda.zip>`


.. include:: plot_topics_extraction_with_nmf_lda.recommendations


.. only:: html

 .. rst-class:: sphx-glr-signature

    `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_