- 什么是语言?
- Formal Language Theory 形式语言理论
- 动机
- 例子
- 除了从属问题之外的问题
- Regular Languages 正则语言
- Finite State Acceptor 正则语言的性质
- Derivational Morphology
- Weighted FSA
- Finite State Transducer (FST)
- FST for Inflectional Morphology
- Non-Regular Languages
- Center Embedding
- 语言模型
- 隐马尔可夫模型
- 循环神经网络
Formal Language Theory 形式语言理论
形式语言理论 (Formal Language Theory) 为我们提供了一种定义语言的框架,它是一种数学框架。
- Studies classes of languages and their computational properties 研究语言的类及其计算性质
Language: set of strings 一种语言 = 字符串 (strings) 的集合
String: sequence of elements from a finite alphabet 一个字符串 = 来自一个有限 字母集 (alphabet) 的 元素 (element) 所组成的序列
- 字母集可以视为 词典 (vocabulary)
- 元素可以视为 单词 (words)
形式语言理论研究的是语言的 类别 (classes) 和它们的计算特性。这门课中,我们将主要介绍以下两种形式语言:
* 正则语言 (Regular Language)
* 上下文无关语言 (Context Free Language)
这两种语言构成了形式语言理论中的前两个类别,之后还有更复杂的 上下文敏感语言 (Context Sensitive Language) 等,但是这门课中我们不会对其进行过多展开。
主要目的是为了解决 从属问题 (membership problem):一个字符串是否属于某种语言。
那么,我们应该怎样做呢?我们可以定义该语言的 语法 (grammar),然后检查该字符串是否符合该语法规则。
- E.g. of language:
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
{01, 001, 011, 0001, ....}
belongs to this language{1, 0, 00, 11, 100, ....}
does not belong to this language
- Even-length sequences from alphabet
{a, b}
: 来自字母集{a, b}
的偶数长度的序列{aa, ab, ba, bb, aaaa, ....}
belongs to this language{aaa, aba, bbb, ....}
does not belong to this language
- 以 wh- 类型的单词作为开头,问号 ?结尾的英文句子
- {what?, where my pants?, …} belongs to this language
- Binary strings that start with 0 and end with 1: 二进制串(Binary strings)以 0 开头,以 1 结尾
- 从属问题(Membership)
- 某个字符串是否属于某种语言?是/否
- Beyond membership problem:
- Scoring: 记分(Scoring)
- Graded membership: How acceptable is a string
- 具有记分等级的从属关系
- 某个字符串在多大程度上可以被接受?(语言模型)
- Transduction: 转导(Transduction)
- Translate one string into another
- 将一个字符串转变为另一个字符串(词干提取 stemming)
- Scoring: 记分(Scoring)
Regular Language
Regular Languages 正则语言
The simplest class of languages 正则语言(Regular language):语言中最简单的类别。
Any regular expression is a regular language 任何 正则表达式(regular expression)都是一种正则语言。
- Describes what strings are part of the language. E.g.
- Describes what strings are part of the language. E.g.
Formally, a regular expression includes the following operations/definitions: 正式地,一个正则表达式包含以下运算:
- Symbol drawn from alphabet Σ 从字母集中抽样得到的符号: Σ
- Empty string ε 空字符串: ε
- Concatenation of two regular expression 两个正则表达式的连接
- Alternation of two regular expressions 两个正则表达式的交替
- Kleene star for 0 or more repeats 星号表示出现 0 次或者重复多次
- Parenthesis
to define scope of operations 圆括号定义运算的有效范围
- Binary strings that start with 0 and ends with 1:
- Even-length sequences from alphabet {a, b}:
- English sentences that start with wh-word and end in ?:
- Binary strings that start with 0 and ends with 1:
Properties of Regular Languages:
- Closure: If we take regular languages L1 and L2 and merge them, is the resulting language regular?
- Regular languages are closed under these conditions/operations:
- Concatenation and union
- Intersection: strings that are valid in both L1 and L2
- Negation: strings that are not in L
- Extremely versatile. Can have regular languages for different properties of language, and use the together.
Finite State Acceptor
Finite State Acceptor 正则语言的性质
Regular expression defines a regular language. But it does not give an algorithm to check whether a string belongs to a language 封闭(Closure):如果我们对正则语言 L1 和 L2 进行合并,得到的结果仍然是正则语言吗?如果是,那么我们将该运算称为 封闭运算(closed operation)。
Finite state acceptor (FSA) describes the computation involved for membership checking 在以下运算中,正则语言是封闭的:
连接(concatenation)和 求并(union):来自封闭的定义。
求交(intersection):在正则语言 L1 和 L2 中都合法的字符串。
求反(negation):不在正则语言 L 中的字符串。 -
FSA consists:
- Alphabet of input symbols Σ
- Set of states
- Start state
- Final states
- Transition function: symbol and state -> next state
Accepts strings if there is a path from
to a final state with transitions matching each symbol- Djisktra’s shortest-path algorithm, complexity O(V logV + E)
- Input alphabet :
{a, b}
- States:
{q0, q1}
- Start, final states:
q0, {q1}
- Transition function:
{(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1}
- Regular expression defined by this FSA:
- Input alphabet :
Derivational Morphology
Use of affixes to change word to another grammatical category
- grace -> graceful -> gracefully
- grace -> disgrace -> disgracefully
- allure -> alluring -> alluringly
- allure -> *allureful
- allure -> *disallure
FSA for Morphology:
- Want to accept valid forms (grace -> graceful)
- Reject invalid ones (allure -> *allureful)
- generalize to other words
Weighted FSA
Some words are more plausible than others:
- fishful vs. disgracelyful
- musicky vs. writey
Weighted FSA: graded measure of acceptability:
- Start state weight function: λ: Q -> R
- Final state weight function: ρ: Q -> R
- Transition function: δ:(Q, Σ, Q) -> R
- Total score of a path:
- Use shortest-path algorithm to find π with minimum cost. Complexity: O(V logV + E)
Finite State Transducer
Finite State Transducer (FST)
Often do not want to just accept or score strings. But want to translate them into another string.
FST add string output capability to FSA
- Includes an output alphabet
- Transitions now take input symbol and emit output symbol (Q, Σ, &Sigma, Q)
Can be weighted (WFST) : Graded scores for transition
E.g. Edit distance as WFST: distance to transform one string to another
FST for Inflectional Morphology
- Verb inflection in Spanish must match the subject in person and number
- Goal of morphological analysis:
- canto -> cantar + VERB + present + 1P + singular
Non-Regular Languages
- Arithmetic expressions with balanced parentheses
- (a + (b * (c / d)))
- Can have arbitrarily many opening parentheses
- Need to remember how many open parentheses to produce the same number of closed parentheses
- Can not be done with finite number of states
Center Embedding
Center embedding of relative clauses
- The cat loves Mozart
- The cat the dog chased loves Mozart
- The cat the dog the rat bit chased loves Mozart
- The cat the dog the rat the elephant admired bit chased loves Mozart
Need to remember the n subject nouns, to ensure n verbs follow
Requires context-free grammar