Lecture 17 Machine Translation

news2025/4/4 10:57:16

why?
- Removes language barrier
- Makes information in any languages accessible to anyone
- But translation is a classic “AI-hard” challenge
- Difficult to preserve the meaning and the fluency of the text after translation
MT is difficult
- Not just simple word for word translation
- Structural changes, e.g., syntax and semantics
- Multiple word translations, idioms(习语，成语，方言)
- Inflections for gender, case etc
- Missing information (e.g., determiners)

Statistical MT

earliy MT
- Started in early 1950s
- Motivated by the Cold War to translate Russian to English
- Rule-based system
  - Use bilingual dictionary to map Russian words to English words
- Goal: translate 1-2 million words an hour within 5 years
statistical MT
- Given French sentence f, aim is to find the best English sentence e
  - $argmax_eP(e|f)$
- Use Baye’s rule to decompose into two components
  - $argmax_e\color{blue}{P(f|e)}\color{red}{P(e)}$
- language vs translation model
  - $argmax_e\color{blue}{P(f|e)}\color{red}{P(e)}$
  - $\color{red}{P(e)}$ : language model
    - learn how to write fluent English text
  - $\color{blue}{P(f|e)}$ : translation model
    - learns how to translate words and phrases from English to French
- how to learn LM and TM
  - Language model:
    - Text statistics in large monolingual(仅一种语言的) corpora (n-gram models)
  - Translation model:
    - Word co-occurrences in parallel corpora
    - i.e. English-French sentence pairs
- parallel corpora
  - One text in multiple languages
  - Produced by human translation
    - Bible, news articles, legal transcripts, literature, subtitles
    - Open parallel corpus: http://opus.nlpl.eu/
- models of translation
  - how to learn $P (f ∣ e)$ from paralell text?
  - We only have sentence pairs; words are not aligned in the parallel text
  - I.e. we don’t have word to word translation
- alignment
  - Idea: introduce word alignment as a latent variable into the model
    - $P (f, a ∣ e)$
  - Use algorithms such as expectation maximisation (EM) to learn (e.g. GIZA++)
  - complexity
    - some words are dropped and have no alignment
    - One-to-many alignment
    - many-to-one alignment
    - many-to-many alignment
- summary
  - A very popular field of research in NLP prior to 2010s
  - Lots of feature engineering
  - State-of-the-art systems are very complex
    - Difficult to maintain
    - Significant effort needed for new language pairs

Neural MT

introduction
- Neural machine translation is a new approach to do machine translation
- Use a single neural model to directly translate from source to target
- from model perspective, a lot simpler
- from achitecture perspective, easier to maintain
- Requires parallel text
- Architecture: encoder-decoder model
  - 1st RNN to encode the source sentence
  - 2nd RNN to decode the target sentence
neural MT
- The decoder RNN can be interpreted as a conditional language model
  - Language model: predicts the next word given previous words in target sentence y
  - Conditional: prediction is also conditioned on the source sentence x
- $P(y|x)=P(y_1|x)P(y_2|y_1,x)...P(y_t|\color{blue}{y_1,...,y_{t-1}},\color{red}{x})$
- training
  - Requires parallel corpus just like statistical MT
  - Trains with next word prediction, just like a language model
  - loss
    - During training, we have the target sentence
    - We can therefore feed the right word from target sentence, one step at a time
- decoding at test time
  - But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)
  - argmax: take the word with the highest probability at every step
  - exposure bias
    - Describes the discrepancy(差异) between training and testing
    - Training: always have the ground truth tokens at each step
    - Test: uses its own prediction at each step
    - Outcome: model is unable to recover from its own error(error propagation)
  - greedy decoding
    - argmax decoding is also called greedy decoding
    - Issue: does not guarantee optimal probability $P (y ∣ x)$
  - exhaustive search decoding
    - To find optimal $P (y ∣ x)$ , we need to consider every word at every step to compute the probability of all possible sequences
    - $O(V^n)$ where V = vocab size; n = sentence length
    - Far too expensive to be feasible
  - beam search decoding
    - Instead of considering all possible words at every step, consider k best words
    - That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
    - k = beam width (typically 5 to 10)
    - k = 1 = greedy decoding
    - k = V = exhaustive search decoding
    - example:
  - when to stop
    - When decoding, we stop when we generate token
    - But multiple hypotheses may terminate their sentence at different time steps
    - We store hypotheses that have terminated, and continue explore those that haven’t
    - Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
- issues of NMT
  - Information of the whole source sentence is represented by a single vector
  - NMT can generate new details not in source sentence
  - NMT tend to generate not very fluent sentences ( $\times$ , usually fluent, a strength)
  - Black-box model; difficult to explain when it doesn’t work
- summary
  - Single end-to-end model
    - Statistical MT systems have multiple subcomponents
    - Less feature engineering
    - Can produce new details that are not in the source sentence (hallucination:错觉，幻觉)

Attention Mechanism

在这里插入图片描述

With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
This creates an information bottleneck(cannot capture all information in a long sentence in a single short vector)
attention
- For the decoder, at every time step allow it to ‘attend’ to words in the source sentence
- encoder-decoder with attention
- variants
  - attention
    - dot product: $s_t^Th_i$
    - bilinear: $s_t^TWh_i$
    - additive: v^Ttanh(W_ss_t+W_hh_i)
  - $c_t$ can be injected to the current state ( $s_t$ ), or to the input word ( $y_t$ )
- summary
  - Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly(reduce hallucination a bit, direct access to source words, less likely to generate new words not related to source sentence)
  - Provides some form of interpretability (look at attention distribution to see what source word is attended to)
    - Attention weights can be seen as word alignments
  - Most state-of-the-art NMT systems use attention
    - Google Translate (https://slator.com/technology/google-facebook-amazonneural-machine-translation-just-had-its-busiest-month-ever/)

Evaluation

MT evaluation
- BLEU: compute n-gram overlap between “reference” translation(ground truth) and generated translation
- Typically computed for 1 to 4-gram
  - $BLEU=BP\times exp(\frac{1}{N}\sum_n^Nlogp_n)$ , where BP $\to$ “Brevity Penalty” to penalise short outputs
  - $p_n=\frac{\# \ correct \ n-grams}{\# \ predicted \ n-grams}$
  - $BP=min(1,\frac{output \ length}{reference \ length})$

Conclusion

Statistical MT
Neural MT
- Nowadays use Transformers rather than RNNs
Encoder-decoder with attention architecture is a general architecture that can be used for other tasks
- Summarisation (lecture 21)
- Other generation tasks such as dialogue generation

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/628547.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！