目录
- 什么是语言?
- Formal Language Theory 形式语言理论
- 动机
- 例子
- 除了从属问题之外的问题
- Regular Languages 正则语言
- Finite State Acceptor 正则语言的性质
- Derivational Morphology
- Weighted FSA
- Finite State Transducer (FST)
- FST for Inflectional Morphology
- Non-Regular Languages
- Center Embedding
什么是语言?
目前为止,我们见过了一些处理单词、句子和文档等符号序列的方法:
- 语言模型
- 隐马尔可夫模型
- 循环神经网络
但是,这些模型都没有涉及到语言的本质,因为它们可以用于处理任何符号序列,而不仅限于单词、句子等。
Formal Language Theory 形式语言理论
形式语言理论 (Formal Language Theory) 为我们提供了一种定义语言的框架,它是一种数学框架。
- Studies classes of languages and their computational properties 研究语言的类及其计算性质
-
Language: set of strings 一种语言 = 字符串 (strings) 的集合
-
String: sequence of elements from a finite alphabet 一个字符串 = 来自一个有限 字母集 (alphabet) 的 元素 (element) 所组成的序列
- 字母集可以视为 词典 (vocabulary)
- 元素可以视为 单词 (words)
动机
形式语言理论研究的是语言的 类别 (classes) 和它们的计算特性。这门课中,我们将主要介绍以下两种形式语言:
* 正则语言 (Regular Language)
* 上下文无关语言 (Context Free Language)
这两种语言构成了形式语言理论中的前两个类别,之后还有更复杂的 上下文敏感语言 (Context Sensitive Language) 等,但是这门课中我们不会对其进行过多展开。
主要目的是为了解决 从属问题 (membership problem):一个字符串是否属于某种语言。
那么,我们应该怎样做呢?我们可以定义该语言的 语法 (grammar),然后检查该字符串是否符合该语法规则。
例子
- E.g. of language:
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
{01, 001, 011, 0001, ....}
belongs to this language{1, 0, 00, 11, 100, ....}
does not belong to this language
- Even-length sequences from alphabet
{a, b}
: 来自字母集{a, b}
的偶数长度的序列{aa, ab, ba, bb, aaaa, ....}
belongs to this language{aaa, aba, bbb, ....}
does not belong to this language
- 以 wh- 类型的单词作为开头,问号 ?结尾的英文句子
- {what?, where my pants?, …} belongs to this language
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
除了从属问题之外的问题
- 从属问题(Membership)
- 某个字符串是否属于某种语言?是/否
- Beyond membership problem:
- Scoring: 记分(Scoring)
- Graded membership: How acceptable is a string
- 具有记分等级的从属关系
- 某个字符串在多大程度上可以被接受?(语言模型)
- Transduction: 转导(Transduction)
- Translate one string into another
- 将一个字符串转变为另一个字符串(词干提取 stemming)
- Scoring: 记分(Scoring)
Regular Language
Regular Languages 正则语言
-
The simplest class of languages 正则语言(Regular language):语言中最简单的类别。
-
Any regular expression is a regular language 任何 正则表达式(regular expression)都是一种正则语言。
- Describes what strings are part of the language. E.g.
0(0|1)*1
描述了什么样的字符串是该语言的一部分
- Describes what strings are part of the language. E.g.
-
Formally, a regular expression includes the following operations/definitions: 正式地,一个正则表达式包含以下运算:
- Symbol drawn from alphabet Σ 从字母集中抽样得到的符号: Σ
- Empty string ε 空字符串: ε
- Concatenation of two regular expression 两个正则表达式的连接
RS
- Alternation of two regular expressions 两个正则表达式的交替
RIS
- Kleene star for 0 or more repeats 星号表示出现 0 次或者重复多次
R*
- Parenthesis
()
to define scope of operations 圆括号定义运算的有效范围
-
E.g.
- Binary strings that start with 0 and ends with 1:
0(0|1)*1
- Even-length sequences from alphabet {a, b}:
((aa)|(ab)|(ba)|(bb))*
- English sentences that start with wh-word and end in ?:
((what)|(where)|(why)|(which)|(whose)|(whom))
Σ*?
- Binary strings that start with 0 and ends with 1:
-
Properties of Regular Languages:
- Closure: If we take regular languages L1 and L2 and merge them, is the resulting language regular?
- Regular languages are closed under these conditions/operations:
- Concatenation and union
- Intersection: strings that are valid in both L1 and L2
- Negation: strings that are not in L
- Extremely versatile. Can have regular languages for different properties of language, and use the together.
Finite State Acceptor
Finite State Acceptor 正则语言的性质
-
Regular expression defines a regular language. But it does not give an algorithm to check whether a string belongs to a language 封闭(Closure):如果我们对正则语言 L1 和 L2 进行合并,得到的结果仍然是正则语言吗?如果是,那么我们将该运算称为 封闭运算(closed operation)。
-
Finite state acceptor (FSA) describes the computation involved for membership checking 在以下运算中,正则语言是封闭的:
连接(concatenation)和 求并(union):来自封闭的定义。
求交(intersection):在正则语言 L1 和 L2 中都合法的字符串。
求反(negation):不在正则语言 L 中的字符串。 -
FSA consists:
- Alphabet of input symbols Σ
- Set of states
Q
- Start state
q0
∈Q
- Final states
F
⊆Q
- Transition function: symbol and state -> next state
-
Accepts strings if there is a path from
q0
to a final state with transitions matching each symbol- Djisktra’s shortest-path algorithm, complexity O(V logV + E)
-
E.g.:
- Input alphabet :
{a, b}
- States:
{q0, q1}
- Start, final states:
q0, {q1}
- Transition function:
{(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1}
- Regular expression defined by this FSA:
a*bb*
- Input alphabet :
Derivational Morphology
-
Use of affixes to change word to another grammatical category
-
E.g.:
- grace -> graceful -> gracefully
- grace -> disgrace -> disgracefully
- allure -> alluring -> alluringly
- allure -> *allureful
- allure -> *disallure
-
FSA for Morphology:
- Want to accept valid forms (grace -> graceful)
- Reject invalid ones (allure -> *allureful)
- generalize to other words
Weighted FSA
-
Some words are more plausible than others:
- fishful vs. disgracelyful
- musicky vs. writey
-
Weighted FSA: graded measure of acceptability:
- Start state weight function: λ: Q -> R
- Final state weight function: ρ: Q -> R
- Transition function: δ:(Q, Σ, Q) -> R
-
Shortest-Path:
- Total score of a path:
- Use shortest-path algorithm to find π with minimum cost. Complexity: O(V logV + E)
Finite State Transducer
Finite State Transducer (FST)
-
Often do not want to just accept or score strings. But want to translate them into another string.
-
FST add string output capability to FSA
- Includes an output alphabet
- Transitions now take input symbol and emit output symbol (Q, Σ, &Sigma, Q)
-
Can be weighted (WFST) : Graded scores for transition
-
E.g. Edit distance as WFST: distance to transform one string to another
FST for Inflectional Morphology
- Verb inflection in Spanish must match the subject in person and number
- Goal of morphological analysis:
- canto -> cantar + VERB + present + 1P + singular
Non-Regular Languages
- Arithmetic expressions with balanced parentheses
- (a + (b * (c / d)))
- Can have arbitrarily many opening parentheses
- Need to remember how many open parentheses to produce the same number of closed parentheses
- Can not be done with finite number of states
Center Embedding
-
Center embedding of relative clauses
- The cat loves Mozart
- The cat the dog chased loves Mozart
- The cat the dog the rat bit chased loves Mozart
- The cat the dog the rat the elephant admired bit chased loves Mozart
-
Need to remember the n subject nouns, to ensure n verbs follow
-
Requires context-free grammar