Lecture 13 Formal Language Theory Finite State Automata

news2025/7/15 19:45:24

- - 什么是语言？
  - Formal Language Theory 形式语言理论
  - 动机
  - 例子
  - 除了从属问题之外的问题
  - Regular Languages 正则语言
  - Finite State Acceptor 正则语言的性质
  - Derivational Morphology
  - Weighted FSA
  - Finite State Transducer (FST)
  - FST for Inflectional Morphology
  - Non-Regular Languages
  - Center Embedding

什么是语言？

目前为止，我们见过了一些处理单词、句子和文档等符号序列的方法：

语言模型
隐马尔可夫模型
循环神经网络

但是，这些模型都没有涉及到语言的本质，因为它们可以用于处理任何符号序列，而不仅限于单词、句子等。

Formal Language Theory 形式语言理论

形式语言理论 (Formal Language Theory) 为我们提供了一种定义语言的框架，它是一种数学框架。

Studies classes of languages and their computational properties 研究语言的类及其计算性质

在这里插入图片描述

Language: set of strings 一种语言 = 字符串 (strings) 的集合
String: sequence of elements from a finite alphabet 一个字符串 = 来自一个有限字母集 (alphabet) 的元素 (element) 所组成的序列
- 字母集可以视为词典 (vocabulary)
- 元素可以视为单词 (words)

动机

形式语言理论研究的是语言的类别 (classes) 和它们的计算特性。这门课中，我们将主要介绍以下两种形式语言：

* 正则语言 (Regular Language)
* 上下文无关语言 (Context Free Language)

这两种语言构成了形式语言理论中的前两个类别，之后还有更复杂的 上下文敏感语言 (Context Sensitive Language) 等，但是这门课中我们不会对其进行过多展开。

主要目的是为了解决 从属问题 (membership problem)：一个字符串是否属于某种语言。

那么，我们应该怎样做呢？我们可以定义该语言的 语法 (grammar)，然后检查该字符串是否符合该语法规则。

例子

E.g. of language:
- Binary strings that start with 0 and end with 1: 二进制串（Binary strings）以 0 开头，以 1 结尾
  - {01, 001, 011, 0001, ....} belongs to this language
  - {1, 0, 00, 11, 100, ....} does not belong to this language
- Even-length sequences from alphabet {a, b}: 来自字母集 {a, b} 的偶数长度的序列
  - {aa, ab, ba, bb, aaaa, ....} belongs to this language
  - {aaa, aba, bbb, ....} does not belong to this language
- 以 wh- 类型的单词作为开头，问号？结尾的英文句子
  - {what?, where my pants?, …} belongs to this language

除了从属问题之外的问题

从属问题（Membership）
- 某个字符串是否属于某种语言？是/否
Beyond membership problem:
- Scoring: 记分（Scoring）
  - Graded membership: How acceptable is a string
  - 具有记分等级的从属关系
  - 某个字符串在多大程度上可以被接受？（语言模型）
- Transduction: 转导（Transduction）
  - Translate one string into another
  - 将一个字符串转变为另一个字符串（词干提取 stemming）

Regular Language

Regular Languages 正则语言

The simplest class of languages 正则语言（Regular language）：语言中最简单的类别。
Any regular expression is a regular language 任何正则表达式（regular expression）都是一种正则语言。
- Describes what strings are part of the language. E.g. 0(0|1)*1 描述了什么样的字符串是该语言的一部分
Formally, a regular expression includes the following operations/definitions: 正式地，一个正则表达式包含以下运算：
- Symbol drawn from alphabet Σ 从字母集中抽样得到的符号： Σ
- Empty string ε 空字符串： ε
- Concatenation of two regular expression 两个正则表达式的连接
  RS
- Alternation of two regular expressions 两个正则表达式的交替RIS
- Kleene star for 0 or more repeats 星号表示出现 0 次或者重复多次 R*
- Parenthesis () to define scope of operations 圆括号定义运算的有效范围
E.g.
- Binary strings that start with 0 and ends with 1: 0(0|1)*1
- Even-length sequences from alphabet {a, b}: ((aa)|(ab)|(ba)|(bb))*
- English sentences that start with wh-word and end in ?: ((what)|(where)|(why)|(which)|(whose)|(whom))Σ*?
Properties of Regular Languages:
- Closure: If we take regular languages L1 and L2 and merge them, is the resulting language regular?
- Regular languages are closed under these conditions/operations:
  - Concatenation and union
  - Intersection: strings that are valid in both L1 and L2
  - Negation: strings that are not in L
- Extremely versatile. Can have regular languages for different properties of language, and use the together.

Finite State Acceptor

Finite State Acceptor 正则语言的性质

Regular expression defines a regular language. But it does not give an algorithm to check whether a string belongs to a language 封闭（Closure）：如果我们对正则语言 L1 和 L2 进行合并，得到的结果仍然是正则语言吗？如果是，那么我们将该运算称为封闭运算（closed operation）。
Finite state acceptor (FSA) describes the computation involved for membership checking 在以下运算中，正则语言是封闭的：
连接（concatenation）和求并（union）：来自封闭的定义。
求交（intersection）：在正则语言 L1 和 L2 中都合法的字符串。
求反（negation）：不在正则语言 L 中的字符串。
FSA consists:
- Alphabet of input symbols Σ
- Set of states Q
- Start state q0 ∈ Q
- Final states F ⊆ Q
- Transition function: symbol and state -> next state
Accepts strings if there is a path from q0 to a final state with transitions matching each symbol
- Djisktra’s shortest-path algorithm, complexity O(V logV + E)
E.g.:
- Input alphabet : {a, b}
- States: {q0, q1}
- Start, final states: q0, {q1}
- Transition function: {(q0, a) -> q0, (q0, b)-> q1, (q1, b) -> q1}
- Regular expression defined by this FSA: a*bb*

Derivational Morphology

Use of affixes to change word to another grammatical category
E.g.:
- grace -> graceful -> gracefully
- grace -> disgrace -> disgracefully
- allure -> alluring -> alluringly
- allure -> *allureful
- allure -> *disallure
FSA for Morphology:
- Want to accept valid forms (grace -> graceful)
- Reject invalid ones (allure -> *allureful)
- generalize to other words

Weighted FSA

Some words are more plausible than others:
- fishful vs. disgracelyful
- musicky vs. writey
Weighted FSA: graded measure of acceptability:
- Start state weight function: λ: Q -> R
- Final state weight function: ρ: Q -> R
- Transition function: δ:(Q, Σ, Q) -> R
Shortest-Path:
- Total score of a path: $\pi = t_1, \ldots, t_N$
$\lambda(t_0) \sum_{i=1}^N \delta(t_i) + \rho(t_N)$
- Use shortest-path algorithm to find π with minimum cost. Complexity: O(V logV + E)

Finite State Transducer

Finite State Transducer (FST)

Often do not want to just accept or score strings. But want to translate them into another string.
FST add string output capability to FSA
- Includes an output alphabet
- Transitions now take input symbol and emit output symbol (Q, Σ, &Sigma, Q)
Can be weighted (WFST) : Graded scores for transition
E.g. Edit distance as WFST: distance to transform one string to another

FST for Inflectional Morphology

Verb inflection in Spanish must match the subject in person and number
Goal of morphological analysis:
- canto -> cantar + VERB + present + 1P + singular

Non-Regular Languages

Arithmetic expressions with balanced parentheses
- (a + (b * (c / d)))
- Can have arbitrarily many opening parentheses
- Need to remember how many open parentheses to produce the same number of closed parentheses
- Can not be done with finite number of states

Center Embedding

Center embedding of relative clauses
- The cat loves Mozart
- The cat the dog chased loves Mozart
- The cat the dog the rat bit chased loves Mozart
- The cat the dog the rat the elephant admired bit chased loves Mozart
Need to remember the n subject nouns, to ensure n verbs follow
Requires context-free grammar