【NLP】LDA模型与实现

2020年04月11日

Author: Guofei

文章归类: 2-4-NLP ，文章编号: 341

版权声明：本文作者是郭飞。转载随意，但需要标明原文链接，并通知本人
原文链接：https://www.guofei.site/2020/04/11/lda.html

LDA（Latent Dirichlet Allocation）是一个关于NLP的模型。区别于另一个LDA(Linear Discrimination Analysis) 是一种有监督降维模型。

先做 CountVectorizer

count_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')

X_cnt_vec = count_vectorizer.fit_transform(data_samples)

然后lda

lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(X_cnt_vec)

lda.components_array # array, [n_components, n_features]，意义：components_[i, j]值得是 topic i和 word j 的关系强度。

n_components: 主题个数
doc_topic_prior:先验Dirichlet分布的参数θθ，默认1/n_components
topic_word_prior:先验Dirichlet分布的参数ββ，默认1/n_components
learning_method: 即LDA的求解算法。有 ‘batch’ 和 ‘online’两种选择。 ‘batch’即我们在原理篇讲的变分推断EM算法，而”online”即在线变分推断EM算法，在”batch”的基础上引入了分步训练，将训练样本分批，逐步一批批的用样本更新主题词分布的算法。默认是”batch”。如果数据量少，用”batch”，需要调参少。如果数据量多，用 “online” ，速度较快。
learning_decay：仅仅在算法使用”online”时有意义，取值最好在(0.5, 1.0]
learning_offset：仅仅在算法使用”online”时有意义，取值要大于1。用来减小前面训练样本批次对最终模型的影响。
max_iter ：EM算法的最大迭代次数。
batch_size: 仅仅在算法使用”online”时有意义，即每次EM算法迭代时使用的文档样本的数量。 evaluate_every

代码官方文档。 https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

理论篇 https://zhuanlan.zhihu.com/p/31470216

您的支持将鼓励我继续创作！

【NLP】LDA模型与实现

【NLP】LDA模型与实现

Recommend

【TensorFlow1】session,变量

苹果侵入Facebook领地：iOS 15新社交功能令扎克伯格抓狂

【Python】open打开

【统计时序2】平稳性

用了iOS 15之后，这是我最喜欢的几个新功能

构建实时数仓 - 当 TiDB 偶遇 Pravega

【DeepDream】初学

【动态最优化】变分法

苹果藏不住了：iPhone 13最大升级稳了

苹果打开Nearby Interactions大门开发者集成更容易

About Joyk