Lecture 20 Topic Modelling

news2024/11/24 5:34:12

目录

      • Topic Modelling
      • A Brief History of Topic Models
      • LDA
      • Evaluation
      • Conclusion

Topic Modelling

  • makeingsense of text

    • English Wikipedia: 6M articles
    • Twitter: 500M tweets per day
    • New York Times: 15M articles
    • arXiv: 1M articles
    • What can we do if we want to learn something about these document collections?
  • questions

    • What are the less popular topics on Wikipedia?
    • What are the big trends on Twitter in the past month?
    • How do the social issues evolve over time in New York Times from 1900s to 2000s?
    • What are some influential research areas?
  • topic models to the rescue

    • Topic models learn common, overlapping themes in a document collection
    • Unsupervised model
      • No labels; input is just the documents!
    • What’s the output of a topic model?
      • Topics: each topic associated with a list of words
      • Topic assignments: each document associated with a list of topics
  • what do topics look like

    • A list of words

    • Collectively describes a concept or subject

    • Words of a topic typically appear in the same set of documents in the corpus(words overlapping in documents) 在这里插入图片描述

    • Wikipedia topics(broad)在这里插入图片描述

    • Twitter topics(short,conversational) 在这里插入图片描述

    • New York Times topics 在这里插入图片描述

  • applications of topic models

    • Personalised advertising(e.g. types of products bought)
    • Search engine
    • Discover senses of polysemous words(e.g. apple: fruit, company, two different clusters)

A Brief History of Topic Models

  • latent semantic analysis 在这里插入图片描述

    • LSA: truncate 在这里插入图片描述

    • issues

      • Positive and negative values in the U U U and V T V^T VT
      • Difficult to interpret(negative values) 在这里插入图片描述
  • probabilistic LSA

    • based on a probabilistic model to get rid of negative values 在这里插入图片描述

    • issues

      • No more negative values!
      • PLSA can learn topics and topic assignment for documents in the train corpus
      • But it is unable to infer topic distribution on new documents
      • PLSA needs to be re-trained for new documents
  • latent dirichlet allocation(LDA)

    • Introduces a prior to the document-topic and topicword distribution
    • Fully generative: trained LDA model can infer topics on unseen documents!
    • LDA is a Bayesian version of PLSA

LDA

  • LDA

    • Core idea: assume each document contains a mix of topics
    • But the topic structure is hidden (latent)
    • LDA infers the topic structure given the observed words and documents
    • LDA produces soft clusters of documents (based on topic overlap), rather than hard clusters
    • Given a trained LDA model, it can infer topics on new documents (not part of train data) 在这里插入图片描述
  • input

    • A collection of documents
    • Bag-of-words
    • Good preprocessing practice:
      • Remove stopwords
      • Remove low and high frequency word types
      • Lemmatisation
  • output

    • Topics: distribution over words in each topic 在这里插入图片描述

    • Topic assignment: distribution over topics in each document 在这里插入图片描述

  • learning

    • How do we learn the latent topics?

    • Two main family of algorithms:

      • Variational methods
      • Sampling-based methods
    • sampling method (Gibbs)

      1. Randomly assign topics to all tokens in documents 在这里插入图片描述

      2. Collect topic-word and document-topic co-occurrence statistics based on the assignments

        • first give some psudo-counts in every cell of two matrix(smoothing,no event is 0) 在这里插入图片描述

        • collect co-occurrence statistics 在这里插入图片描述

      3. Go through every word token in corpus and sample a new topic:

        • delete current topic assigned to a word 在这里插入图片描述

        • update two matrices 在这里插入图片描述

        • compute the probability distribution to sample: P ( t i ∣ w , d ) ∝ P ( t i ∣ w ) P ( t i ∣ d ) P(t_i|w,d) \propto P(t_i|w)P(t_i|d) P(tiw,d)P(tiw)P(tid) ( P ( t i ∣ w ) → P(t_i|w) \to P(tiw) topic-word, P ( t i ∣ d ) → P(t_i|d) \to P(tid) document-topic) 在这里插入图片描述

          • P ( t 1 ∣ w , d ) = P ( t 1 ∣ m o u s e ) × P ( t 1 ∣ d 1 ) = 0.01 0.01 + 0.01 + 2.01 × 1.1 1.1 + 1.1 + 2.1 P(t_1|w,d)=P(t_1|mouse)\times{P(t_1|d_1)}=\frac{0.01}{0.01+0.01+2.01}\times{\frac{1.1}{1.1+1.1+2.1}} P(t1w,d)=P(t1mouse)×P(t1d1)=0.01+0.01+2.010.01×1.1+1.1+2.11.1
        • sample randomly based on the probability distribution

      4. Go to step 2 and repeat until convergence

      • when to stop
        • Train until convergence
        • Convergence = model probability of training set becomes stable
        • How to compute model probability?
          • l o g P ( w 1 , w 2 , . . . , w m ) = l o g ∑ j = 0 T P ( w 1 ∣ t j ) P ( t j ∣ d w 1 ) + . . . + l o g ∑ j = 0 T P ( w m ∣ t j ) P ( t j ∣ d w m ) logP(w_1,w_2,...,w_m)=log\sum_{j=0}^TP(w_1|t_j)P(t_j|d_{w_1})+...+log\sum_{j=0}^TP(w_m|t_j)P(t_j|d_{w_m}) logP(w1,w2,...,wm)=logj=0TP(w1tj)P(tjdw1)+...+logj=0TP(wmtj)P(tjdwm)
          • m = #word tokens
          • P ( w 1 ∣ t j ) → P(w_1|t_j) \to P(w1tj) based on the topic-word co-occurrence matrix
          • P ( t j ∣ d w 1 ) → P(t_j|d_{w_1}) \to P(tjdw1) based on the document-topic co-occurrence matrix
      • infer topics for new documents
        1. Randomly assign topics to all tokens in new/test documents 在这里插入图片描述

        2. Update document-topic matrix based on the assignments; but use the trained topic-word matrix (kept fixed) 在这里插入图片描述

        3. Go through every word in the test documents and sample topics: P ( t i ∣ w , d ) ∝ P ( t i ∣ w ) P ( t i ∣ d ) P(t_i|w,d) \propto P(t_i|w)P(t_i|d) P(tiw,d)P(tiw)P(tid)

        4. Go to step 2 and repeat until convergence

      • hyper-parameters
        • T T T: number of topic 在这里插入图片描述

        • β \beta β: prior on the topic-word distribution

        • α \alpha α: prior on the document-topic distribution

        • Analogous to k in add-k smoothing in N-gram LM

        • Pseudo counts to initialise co-occurrence matrix: 在这里插入图片描述

        • High prior values → \to flatter distribution 在这里插入图片描述

          • a very very large value would lead to a uniform distribution
        • Low prior values → \to peaky distribution 在这里插入图片描述

        • β \beta β: generally small (< 0.01)

          • Large vocabulary, but we want each topic to focus on specific themes
        • α \alpha α: generally larger (> 0.1)

          • Multiple topics within a document

Evaluation

  • how to evaluate topic models
    • Unsupervised learning → \to no labels
    • Intrinsic(内在的,固有的) evaluation:
      • model logprob / perplexity(困惑度,复杂度) on test documents
      • l o g L = ∑ W ∑ T l o g P ( w ∣ t ) P ( t ∣ d w ) logL=\sum_W\sum_TlogP(w|t)P(t|d_w) logL=WTlogP(wt)P(tdw)
      • p p l = e x p − l o g L W ppl=exp^{\frac{-logL}{W}} ppl=expWlogL
  • issues with perlexity
    • More topics = better (lower) perplexity
    • Smaller vocabulary = better perplexity
      • Perplexity not comparable for different corpora, or different tokenisation/preprocessing methods
    • Does not correlate with human perception of topic quality
    • Extrinsic(外在的) evaluation the way to go:
      • Evaluate topic models based on downstream task
  • topic coherence
    • A better intrinsic evaluation method

    • Measure how coherent the generated topics (blue more coherent than red)在这里插入图片描述

    • A good topic model is one that generates more coherent topics

  • word intrusion
    • Idea: inject one random word to a topic
      • {farmers, farm, food, rice, agriculture} → \to {farmers, farm, food, rice, cat, agriculture}
    • Ask users to guess which is the intruder word
    • Correct guess → \to topic is coherent
    • Try guess the intruder word in:
      • {choice, count, village, i.e., simply, unionist}
    • Manual effort; does not scale
  • PMI ≈ \approx coherence?
    • High PMI for a pair of words → \to words are correlated
      • PMI(farm, rice) ↑ \uparrow
      • PMI(choice, village) ↓ \downarrow
    • If all word pairs in a topic has high PMI → \to topic is coherent
    • If most topics have high PMI → \to good topic model
    • Where to get word co-occurrence statistics for PMI?
      • Can use same corpus for topic model
      • A better way is to use an external corpus (e.g. Wikipedia)
  • PMI
    • Compute pairwise PMI of top-N words in a topic
      • P M I ( t ) = ∑ j = 2 N ∑ i = 1 j − 1 l o g P ( w i , w j ) P ( w i ) P ( w j ) PMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)P(w_j)} PMI(t)=j=2Ni=1j1logP(wi)P(wj)P(wi,wj)
    • Given topic: {farmers, farm, food, rice, agriculture}
    • Coherence = sum PMI for all word pairs:
      • PMI(farmers, farm) + PMI(farmers, food) + … + PMI(rice, agriculture)
    • variants
      • Normalised PMI
        • N P M I ( t ) = ∑ j = 2 N ∑ i = 1 j − 1 l o g P ( w i , w j ) P ( w i ) P ( w j ) − l o g P ( w i , w j ) NPMI(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}\frac{log\frac{P(w_i,w_j)}{P(w_i)P(w_j)}}{-logP(w_i,w_j)} NPMI(t)=j=2Ni=1j1logP(wi,wj)logP(wi)P(wj)P(wi,wj)
      • conditional probability (proved not as good as PMI)
        • L C P ( t ) = ∑ j = 2 N ∑ i = 1 j − 1 l o g P ( w i , w j ) P ( w i ) LCP(t)=\sum_{j=2}^N\sum_{i=1}^{j-1}log\frac{P(w_i,w_j)}{P(w_i)} LCP(t)=j=2Ni=1j1logP(wi)P(wi,wj)
    • example (PMI tends to favor rarer words, use NPMI to relieve this problem)在这里插入图片描述

Conclusion

  • Topic model: an unsupervised model for learning latent concepts in a document collection
  • LDA: a popular topic model
    • Learning
    • Hyper-parameters
  • How to evaluate topic models?
    • Topic coherence

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/628247.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

8寸Windows 10/Android 4.4系统三防平板电脑

8寸Windows 10/Android 4.4系统三防平板电脑是一款功能强大的工业平板电脑&#xff0c;能够在恶劣的工业环境中工作&#xff0c;并能够满足各种生产应用需求。该平板电脑采用了三防设计&#xff0c;能够防护尘土、水等物质的侵入&#xff0c;同时也能够抵抗震动&#xff0c;保证…

openGauss社区五月运作报告

前言 五月&#xff0c;openGauss社区在北京举办了一年一度的开发者大会&#xff0c;汇报社区最新的技术创新进展、生态进展与商业实践成果&#xff0c;同期社区开展了多个闭门会议&#xff0c;SIG工作组会议&#xff0c;规划未来社区工作治理方向&#xff0c;社区工作事项分…

实战案例|黑灰产肆虐,腾讯ACE一键打造清朗游戏世界

随着游戏行业的快速发展&#xff0c;相关黑色产业链的问题日益严重&#xff0c;各种外挂、违规内容、非法交易现象的出现破坏着游戏的生态&#xff0c;为行业带来诸多安全挑战&#xff0c;也影响着玩家们的游戏体验。越来越多游戏厂商开始重视游戏安全问题&#xff0c;并探索全…

RocketMQ5.x版本延迟消息被重放问题调查

一、问题 由于目标计划是将集群从4.9.x逐步升级至5.x&#xff0c;故目前先对一些不重要的集群进行升级测试。 但是在4.x的broker陆续升级至5.x的过程中&#xff0c;发现了延迟消息被重放的问题。 具体如下: 在升级时刷新后台监控&#xff0c;发现竟然有写入量&#xff1a; 即…

日撸java三百行day61-62

文章目录 说明Day61 决策树1.什么是决策树2.什么是熵3.什么是信息增益4.详细例子1. weather样本2.第一次决策3.第二次决策4.最终决策树 4. 代码理解4.1 变量理解4.2 代码中主要方法理解 说明 闵老师的文章链接&#xff1a; 日撸 Java 三百行&#xff08;总述&#xff09;_minf…

深入源码分析RecyclerView缓存复用原理

文章目录 前言四级缓存 源码分析缓存一级缓存&#xff08;mChangedScrap和mChangedScrap&#xff09;二级缓存&#xff08;mCachedViews&#xff09;三级缓存四级缓存&#xff08;mRecyclerPool&#xff09;缓存池mRecyclerPool结构理解四级缓存简单小结 缓存流程图 复用复用流…

分布式项目15 用户注册,单点登陆dubbo来实现

分析&#xff1a;当用户填写完成注册信息之后,将请求发送给前台服务器.之后前台消费者利用dubbo框架实现RPC调用。之后将用户信息传递给jt-sso服务提供者.之后完成数据的入库操作。 01.页面url分析 02.查看页面JS $.ajax({ type : "POST", url : "/user/doRe…

MMPose安装及推理验证

MMPose安装 依赖环境 1.创建虚拟环境并激活 conda create --name openmmlab python3.8 -y conda activate openmmlab2.安装pytorch conda install pytorch torchvision torchaudio pytorch-cuda11.7 -c pytorch -c nvidia报错&#xff1a; InvalidArchiveError(‘Error wit…

quartus下联合modelsim_Altera仿真

vivado工程转换到quartus下联合modelsim仿真_内有小猪卖的博客-CSDN博客 这个博客是用单独的modelsim仿真&#xff0c;而下面的流程是使用quartus自带的modelsim-altera仿真。 版本为&#xff1a;quartus ii 13.1 64-bit 以fpga实现数码管和流水灯编码为例。数码管为1时&#x…

Spring中Bean加载流程

上面是跟踪了 getBean 的调用链创建的流程图&#xff0c;为了能够很好地理解 Bean 加载流程&#xff0c;省略一些异常、日志和分支处理和一些特殊条件的判断。 从上面的流程图中&#xff0c;可以看到一个 Bean 加载会经历这么几个阶段&#xff08;用绿色标记&#xff09;&…

机器学习算法分类(三)

在机器学习中&#xff0c;又分为监督学习、无监督学习、半监督学习、强化学习和深度学习。 监督、无监督、半监督学习 机器学习根据数据集是否有标签&#xff0c;又分为监督学习、无监督学习、半监督学习。 监督学习&#xff1a;训练数据集全部都有标签无监督学习&#xff1a…

grep命令的使用

grep命令是Linux中常用的文本搜索工具&#xff0c;它可以根据用户指定的模式&#xff0c;在文件或标准输入中查找匹配的文本行并返回。 下面是grep命令的一些常见选项&#xff1a; -i&#xff1a;忽略大小写-n&#xff1a;显示匹配行的行号-v&#xff1a;显示不匹配的行-r&am…

【新版】系统架构设计师 - 计算机系统基础知识

个人总结&#xff0c;仅供参考&#xff0c;欢迎加好友一起讨论 文章目录 架构 - 计算机系统基础知识考点摘要计算机系统计算机硬件组成浮点数Flynn分类法CISC与RISC流水线技术超标量流水线存储系统层次化存储结构CacheCache的命中率Cache的页面淘汰主存编址磁盘管理&#xff08…

《MySQL(四):基础篇- 约束》

文章目录 4. 约束4.1 概述4.2 约束演示4.3 外键约束4.3.1 介绍4.3.2 语法4.3.3 删除/更新行为 4. 约束 4.1 概述 概念&#xff1a;约束是作用于表中字段上的规则&#xff0c;用于限制存储在表中的数据。 目的&#xff1a;保证数据库中数据的正确、有效性和完整性。 分类: 约…

YOLOv5白皮书-第Y6周:模型改进

目录 一、改进网络结构设计1 改进的注意力机制2 多尺度特征融合3 改进的激活函数 二 数据增强和数据平衡1 数据增强2 数据平衡3 注意事项 三、模型融合策略1 投票策略2 加权平均策略3 特征融合策略4 其他模型融合策略 &#x1f368; 本文为&#x1f517;365天深度学习训练营 中…

protobuf实现原理

文章目录 一、前言二、概述三、数据存储方式&#xff1a;Varints(一)原理(二)举例(三)缺点 四、协议的数据结构(一)原理(二)举例 一、前言 最近刚刚从一家公司离职&#xff0c;在职的时候使用到了go语言的grpc库&#xff0c;了解了除了json之外的另一个专门用于远程调用的序列…

二本计算机专业学长经验之谈

2023.6.9 今年的行情对我们这些双非大学、二本真的太难了&#xff0c;菜鸟今年感觉毕业找的工作真的又苦逼钱又少&#xff0c;准备跳槽的&#xff0c;结果满大街投简历&#xff0c;连个毛都没有&#xff0c;唯一一个给了个海笔&#xff0c;然后就没然后… 所以希望大家真的要好…

Element的Select分组全选模式

Select 选择器选择器的分组&#xff0c;如上图所示&#xff0c;我们希望做到的效果是&#xff0c;点击“热门城市”或“城市名”的时候全选分组的options。 思路 思路一&#xff1a;目前的Select 选择器分组OptionGroup的Title只是一个文本DOM&#xff0c;没用其他东西&#…

详解基于罗德里格斯(Rodrigues)公式由旋转向量到旋转矩阵的 Python 实现

文章目录 旋转向量 rotation vector旋转矩阵 rotation matrix罗德里格斯公式 Rodrigues formula基于 Python 和 NumPy 实现 Rodrigues 公式 旋转向量 rotation vector 任何一个旋转都可以通过一个 旋转轴 加一个 旋转角 进行描述, 即围绕 旋转轴 旋转一个 旋转角. 此时可以通过…

javascript 中的 URL 解码

文章目录 需要URL编解码JavaScript 中的 URL 解码使用 unescaped() 方法解码编码的 URL使用 decodeURI() 方法解码编码的 URL使用 decodeURIComponent() 方法解码编码的 URL 总结 本文着眼于 URL 解码以及如何使用 JavaScript 对编码的 URL 进行解码。 需要URL编解码 URL 应具…