主题建模评估：连贯性分数（Coherence Score）

1.主题连贯性分数

主题连贯性分数（Coherence Score）是一种客观的衡量标准，它基于语言学的分布假设：具有相似含义的词往往出现在相似的上下文中。如果所有或大部分单词都密切相关，则主题被认为是连贯的。

推荐阅读：Full-Text or Abstract ? Examining Topic Coherence Scores Using Latent Dirichlet Allocation

2.计算 LDA 模型的 Coherence Score

2.1 导入包

import pandas as pd
import numpy as np
from gensim.corpora import Dictionary
from gensim.models import LdaMulticore

2.2 数据预处理

# cast tweets to numpy array
docs = df.tweet_text.to_numpy()

# create dictionary of all words in all documents
dictionary = Dictionary(docs)

# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# create BOW dictionary
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

2.3 构建模型

LdaMulticore：使用所有 CPU 内核并行化并加速模型训练。

workers：用于并行化的工作进程数。
passes：训练期间通过语料库的次数。

# create LDA model using preferred hyperparameters
lda_model = LdaMulticore(bow_corpus, num_topics=5, id2word=dictionary, passes=4, workers=2, random_state=21)

# Save LDA model to disk
path_to_model = ""
lda_model.save(path_to_model)

# for each topic, print words occuring in that topic
for idx, topic in lda_model.print_topics(-1):
    print('Topic: {} \nWords: {}'.format(idx, topic))

在这里插入图片描述

2.4 计算 Coherence Score

我们可以使用 Gensim 库中的 CoherenceModel 轻松计算主题连贯性分数。对于 LDA，它的实现比较简单：

# import library from gensim  
from gensim.models import CoherenceModel

# instantiate topic coherence model
cm = CoherenceModel(model=lda_model, corpus=bow_corpus, texts=docs, coherence='c_v')

# get topic coherence score
coherence_lda = cm.get_coherence() 
print(coherence_lda)

3.计算 GSDMM 模型的 Coherence Score

GSDMM（Gibbs Sampling Dirichlet Multinomial Mixture）是一种基于狄利克雷多项式混合模型的收缩型吉布斯采样算法（a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model）的简称，它是发表在 $2014$ 年 KDD 上的论文《A Dirichlet Multinomial Mixture Model-based Approach for Short Text Clustering》的数学模型。

GSDMM 主要用于短文本聚类，短文本聚类是将大量的短文本（例如微博、评论等）根据计算某种相似度进行聚集，最终划分到几个类中的过程。GSDMM 主要具备以下优点：

可以自动推断聚类的个数，并且可以快速地收敛；
可以在完备性和一致性之间保持平衡；
可以很好的处理稀疏、高纬度的短文本，可以得到每一类的代表词汇；
较其它的聚类算法，在性能上表现更为突出。

3.1 安装并导入包

pip install git+https://github.com/rwalk/gsdmm.git

import pandas as pd
import numpy as np
from gensim.corpora import Dictionary
from gsdmm import MovieGroupProcess

3.2 数据预处理

# cast tweets to numpy array
docs = df.tweet_text.to_numpy()

# create dictionary of all words in all documents
dictionary = Dictionary(docs)

# filter extreme cases out of dictionary
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# create variable containing length of dictionary/vocab
vocab_length = len(dictionary)

# create BOW dictionary
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]

3.3 构建模型

# initialize GSDMM
gsdmm = MovieGroupProcess(K=15, alpha=0.1, beta=0.3, n_iters=15)

# fit GSDMM model
y = gsdmm.fit(docs, vocab_length)

在这里插入图片描述

# print number of documents per topic
doc_count = np.array(gsdmm.cluster_doc_count)
print('Number of documents per topic :', doc_count)

# Topics sorted by the number of document they are allocated to
top_index = doc_count.argsort()[-15:][::-1]
print('Most important clusters (by number of docs inside):', top_index)

# define function to get top words per topic
def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print("\nCluster %s : %s"%(cluster, sort_dicts))

# get top words in topics
top_words(gsdmm.cluster_word_distribution, top_index, 20)

在这里插入图片描述

3.4 结果可视化

# Import wordcloud library
from wordcloud import WordCloud

# Get topic word distributions from gsdmm model
cluster_word_distribution = gsdmm.cluster_word_distribution

# Select topic you want to output as dictionary (using topic_number)
topic_dict = sorted(cluster_word_distribution[topic_number].items(), key=lambda k: k[1], reverse=True)[:values]

# Generate a word cloud image
wordcloud = WordCloud(background_color='#fcf2ed', 
                      width=1800,
                      height=700,
                      font_path=path_to_font,
                      colormap='flag').generate_from_frequencies(topic_dict)

# Print to screen
fig, ax = plt.subplots(figsize=[20,10])
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off");

# Save to disk
wordcloud_24.to_file(path_to_file)

在这里插入图片描述

3.5 计算 Coherence Score

Gensim 官网对 LDA 之外的模型使用另一种方法计算 连贯性分数。

from gensim.test.utils import common_corpus, common_dictionary
from gensim.models.coherencemodel import CoherenceModel

topics = [
    ['human', 'computer', 'system', 'interface'],
    ['graph', 'minors', 'trees', 'eps']
]

cm = CoherenceModel(topics=topics, corpus=common_corpus, dictionary=common_dictionary, coherence='u_mass')

coherence = cm.get_coherence()  # get coherence value

GSDMM 的实现需要更多的工作，因为我们首先必须将 主题中的单词作为列表（变量主题）获取，然后将其输入 CoherenceModel。

# import library from gensim  
from gensim.models import CoherenceModel

# define function to get words in topics
def get_topics_lists(model, top_clusters, n_words):
    '''
    Gets lists of words in topics as a list of lists.
    
    model: gsdmm instance
    top_clusters:  numpy array containing indices of top_clusters
    n_words: top n number of words to include
    
    '''
    # create empty list to contain topics
    topics = []
    
    # iterate over top n clusters
    for cluster in top_clusters:
        # create sorted dictionary of word distributions
        sorted_dict = sorted(model.cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:n_words]
         
        # create empty list to contain words
        topic = []
        
        # iterate over top n words in topic
        for k,v in sorted_dict:
            # append words to topic list
            topic.append(k)
            
        # append topics to topics list    
        topics.append(topic)
    
    return topics

# get topics to feed to coherence model
topics = get_topics_lists(gsdmm, top_index, 20) 

# evaluate model using Topic Coherence score
cm_gsdmm = CoherenceModel(topics=topics, dictionary=dictionary, corpus=bow_corpus, texts=docs, coherence='c_v')

# get coherence value
coherence_gsdmm = cm_gsdmm.get_coherence()  

print(coherence_gsdmm)