Lecture 9 Lexical Semantics

- - Introduction: sentiment analysis 引言：情感分析
  - Word Semantics 单词语义
  - Word meanings 单词含义
  - WordNet
  - Synsets 同义词集
  - Noun Relations in WordNet
  - Hypernymy Chain 上位链
  - Word Similarity
  - Word Similarity with Paths
  - Beyond Path Length
  - Abstract Nodes
  - Concept Probability of A Node
  - Similarity with Information Content
  - Word Sense Disambiguation
  - Supervised WSD
  - Unsupervised WSD: Lesk
  - Unsupervised WSD: Clustering

Introduction: sentiment analysis 引言：情感分析

在 NLP 中，我们为什么要关注词汇语义学？我们先来看一个情感分析的例子：假设现在我们有一个情感分析任务，我们需要预测一段给定文本的情感极性。

Bag-of-words, KNN classifier. Training data: 词袋模型，KNN分类器。训练数据
- This is a good movie -> positive
- This is a great movie -> positive
- This is a terrible film -> negative
- This is a wonderful film -> ?
Two problems here: 这里存在两个问题
- The model does not know that movie and film are synonyms. Since film appears only in negative examples, the model learns that it is a negative word. 模型不知道 movie 和 film 是同义词。由于film只在负面示例中出现，模型学习到它是一个负面词语
- wonderful is not in the vocabulary (OOV: Out-Of-Vocabulary) wonderful这个单词在词汇表中并没有出现过（OOV, Out-Of-Vocabulary）
Comparing words directly will not work. How to make sure we compare word meanings? 直接比较单词并不是一种很好的方法。我们应当如何保证我们是在比较单词的含义呢？
Solution: Add this information explicitly through a lexical database 解决方案：通过一个词汇数据库（lexical database）来显式地加入这些信息。

Lexical Database 词汇数据库

Word Semantics 单词语义

Lexical Semantics: 词汇语义
- How th meanings of words connect to one another 单词含义之间如何相互联系
- Manually constructed resources 手动构建的资源：词汇表 (lexicons)、同义词词典 (thesauri)、本体论 (ontologies) 等。
  
  我们可以用文本来描述单词的含义，我们也可以观察不同单词之间是如何相互联系的。例如：单词film和movie实际上是同义词（synonym），所以，假如我们不知道film的意思，但是我们知道movie的意思，并且假如我们还知道两者是同义词关系的话，我们就可以知道单词film的意思。我们将看到如何通过手工构建这样的词汇数据库，这些同义词词典或者本体论捕获了单词含义之间的联系。
Distributional Semantics: 分布语义学
- How words relate to each other in the text 文本中的单词之间如何互相关联。
- Automatically created resources from corpora 从语料库中自动创建资源。
  
  我们也可以用另一种方式完成同样的事情。我们的任务仍然是捕获单词的含义，但是相比雇佣语言学家来手工构建词汇数据库，我们可以尝试从语料库中直接学习单词含义。我们尝试利用机器学习或者语料库的一些统计学方法来观察单词之间是如何互相关联的，而不是从语言学专家那里直接得到相关信息。

Word meanings 单词含义

物理或社交世界中的被引用的对象
- 但通常在文本分析中没有用
  
  回忆你小时候尝试学习一个新单词的场景，对于人类而言，单词的含义包含了对于物理世界的引用。例如：当你学习 dog（狗） 这个单词时，你会问自己，什么是dog？你不会仅通过文本或者口头描述来学习这个单词，而是通过观察真实世界中的狗来认识这个单词，这其中涉及到的信息不止包含语言学，而且还包括狗的叫声、气味等其他信息，所有这些信息共同构成了dog这个单词的含义。但是这些其他的信息通常在文本分析中并没有太大作用，并且我们也不容易对其进行表示。
Dictionary definition: 字典定义
- Dictionary definitions are necessarily circular 字典定义必然是循环的
- Only useful if meaning is already understood 仅在已经理解含义的情况下才有用
  
  因此，我们希望寻找一种其他方法来学习单词的含义：通过查词典学习单词含义。但是，我们会发现词典定义通常带有循环性质，我们用一些其他单词来解释目标单词。
- E.g
  
  red: n. the color of blood or a ruby
  
  blood: n. the red liquid that circulates in the heart, arteries, and veins of animals
  
  Here the word red is described by blood and blood is described by red. Therefore, to understand red and blood both meaning has to be understood
  
  可以看到，在定义red（红色） 这个单词时，我们将其描述为blood（血液）的颜色；然后在定义blood（血液）这个单词时，我们将其描述为心脏中的一种red（红色） 液体。所以，我们用 blood定义red，然后又用red定义 blood。如果我们本身不知道这两个单词的含义，那么我们无法从定义中获得词义。但是，字典定义仍然是非常有用的，因为当我们通过字典学习一个新的单词时，我们通常已经具有了一定的词汇背景，例如当我们学习一门新的语言时，字典可以提供一些非常有用的信息。
Their relationships with other words. 它们与其他单词的关系
- Also circular, but better for text analysis 也是循环的，但更实用
  
  另一种学习词义的方法是查看目标单词和其他单词的关系。同样，这种方法也涉及到循环性的问题，但是，当我们需要结合上下文使用某个单词时，这种方法非常有用，就像之前film和movie的例子。所以，单词之间的关系是另一种非常好的表征词义的方式。
Word sense: A word sense describes one aspect of the meaning of a word 单词义项：单词义项描述了单词含义的一个方面
- E.g. mouse: a quiet animal like a mouse
Polysemous: If a word has multiple senses, it is polysemous. 多义词：如果一个单词有多个义项，那么它就是多义词。
- E.g.
  - mouse¹: a mouse controlling a computer system in 1968
  - mouse²: a quiet animal like a mouse
Gloss: Textual definition of a sense, given by a dictionary 词义释义：由字典给出的一个义项的文本定义
Meaning Through Relations: 通过关系理解含义
- Synonymy(同义): near identical meaning 几乎相同的含义
  - vomit - throw up
  - big - large
- Antonymy(反义): opposite meaning 相反的含义
  - long - short
  - big - little
- Hypernymy(上位关系): is-a relation is-a 关系
  - 前者为下位词 (hyponym)，表示后者的一个更加具体的实例，例如cat。
  - 后者为上位词 (hypernym)，表示比前者更宽泛的一个类别，例如animal。
  - cat - animal
  - mango - fruit
- Meronymy(部分-整体关系): part-whole relation 部分-整体关系
  - 前者为部件词 (meronym)，表示后者的一部分，例如leg。
  - 后者为整体词 (holonym)，表示包含前者的一个整体，例如chair。
  - leg - chair
  - whel - car
Eg:

WordNet

A database of lexical relations 一个词汇关系的数据库
English WordNet includes ~120,000 nouns, ~12,000 verbs, ~21,000 adjectives, ~4,000 adverbs
On average: noun has 1.23 senses, verbs 2.16 平均来说：名词有1.23个义项，动词有2.16个义项
Eg:
可以看到，名词bass的词义基本上可以分为两大类：音乐和鲈鱼。而 WordNet 又将其细分为了 8 个类别。但是，这种分类对于一般的 NLP 任务而言可能太细了，所以，在使用这些词义之前，我们通常会进行一些聚类（clustering）操作。

Synsets 同义词集

Nodes of WordNet are not words or lemmas, but senses WordNet 的节点不是单词或词形，而是义项
There are represented by sets of synonyms, or called synsets 这些都由一组同义词表示，或称为同义词集
E.g. Bass:
- {bass, deep}
- {bass, bass voice, basso}

Noun Relations in WordNet

在这里插入图片描述

Hypernymy Chain 上位链

在这里插入图片描述

Word Similarity

Synonymy: file - movie
What about show - file and opera - film?
Unlike synonymy which is a binary relation, word similarity is a spectrum
Use lexical database or thesaurus(分类词词典) to estimate word similarity

Word Similarity with Paths

Given WordNet, find similarity based on path length
pathlen(c₁, c₂) = 1 + edge length in the shortest path between sense c₁ and c₂
Similarity between two senses:

$simpath(c_1, c_2) = \frac{1}{pathlen(c_1, c_2)}$
Similarity between two words:

$wordsim(w_1, w_2) = max_{c_1 \in senses(w_1), c_2 \in senses(w_2)}simpath(c_1, c_2)$
E.g.

Beyond Path Length

Problem of simple path length: Edges vary widely in actual semantic distance
- E.g. from last example tree:
  - simpath(nickel, money) = 0.17
  - simpath(nickel, Richter scale) = 0.13
  - From the simple path length, similarity of nickel-money and nickel-Richter scale are very close. But in actual meanings nickel is much similar to money then Richter scale
Solution 1: include depth information
- Use path to find lowest common subsumer (LCS)
- Compare using depths:
  
  $simwup(c_1, c_2) = \frac{2 * depth(LCS(c_1, c_2))}{depth(c_1) + depth(c_2)}$
  
  High simwup when parent is deep or senses are shallow
- E.g.

Abstract Nodes

Node depth is still poor semantic distance metric. E.g.:
- simwup(nickel, money) = 0.44
- simwup(nickel, Richter scale) = 0.22
Node high in the hierarchy is very abstract or general

Concept Probability of A Node

Intuition:
- general node -> high concept probability
- narrow node -> low concept probability
Find all the children of the node, and sum up their unigram probabilities: $P(c) = \frac{\sum_{s \in child(c)}count(s)}{N}$
- child©: synsets that are children of c

Similarity with Information Content

Information Content: $IC = -logP(c)$
- general concept = small values
- narrow concept = large values
simlin : $simlin(c_1, c_2) = \frac{2*IC(LCS(c_1, c_2))}{IC(c_1) + IC(c_2)}$
- High simlin when concept of parent is narrow or concept of senses are general
E.g

$simlin(hill, coast) = \frac{2*-logP(geolocial-formation)}{-logP(hill) - logP(coast)} = \frac{-2log0.00176}{-log0.0000189-log0.0000216}$

Word Sense Disambiguation

Task: Selects the correct sense for words in a sentence
Baseline: Assume the most popular sense
Good WSD potentially useful for many tasks:
- Knowing which sense of mouse is used in a sentence is important
- Less popular nowadays because sense information is implicitly captured by contextual representations

Supervised WSD

Apply standard machine classifiers
Feature vectors are typically words and syntax around target
Requires sense-tagged corpora
- E.g. SENSEVAL, SEMCOR
- Very time-consuming to create

Unsupervised WSD: Lesk

Lesk: Choose sense whose WordNet gloss overlaps most with the context
E.g.

Unsupervised WSD: Clustering

Gather usages of the word
Perform clustering on context words to learn the different senses
- Rationale: context words of the same sense should be similar
Disadvantages:
- Sense cluster not very interpretable
- Need to align with dictionary senses