Lecture 17 Machine Translation

news2024/12/23 10:53:03

目录

      • Statistical MT
      • Neural MT
      • Attention Mechanism
      • Evaluation
      • Conclusion

Machine translation (MT) is the task of translating text from one source language to another target language 在这里插入图片描述

  • why?
    • Removes language barrier
    • Makes information in any languages accessible to anyone
    • But translation is a classic “AI-hard” challenge
    • Difficult to preserve the meaning and the fluency of the text after translation
  • MT is difficult
    • Not just simple word for word translation
    • Structural changes, e.g., syntax and semantics
    • Multiple word translations, idioms(习语,成语,方言)
    • Inflections for gender, case etc
    • Missing information (e.g., determiners) 在这里插入图片描述

Statistical MT

  • earliy MT
    • Started in early 1950s
    • Motivated by the Cold War to translate Russian to English
    • Rule-based system
      • Use bilingual dictionary to map Russian words to English words
    • Goal: translate 1-2 million words an hour within 5 years
  • statistical MT
    • Given French sentence f, aim is to find the best English sentence e

      • a r g m a x e P ( e ∣ f ) argmax_eP(e|f) argmaxeP(ef)
    • Use Baye’s rule to decompose into two components

      • a r g m a x e P ( f ∣ e ) P ( e ) argmax_e\color{blue}{P(f|e)}\color{red}{P(e)} argmaxeP(fe)P(e)
    • language vs translation model

      • a r g m a x e P ( f ∣ e ) P ( e ) argmax_e\color{blue}{P(f|e)}\color{red}{P(e)} argmaxeP(fe)P(e)
      • P ( e ) \color{red}{P(e)} P(e): language model
        • learn how to write fluent English text
      • P ( f ∣ e ) \color{blue}{P(f|e)} P(fe): translation model
        • learns how to translate words and phrases from English to French
    • how to learn LM and TM

      • Language model:
        • Text statistics in large monolingual(仅一种语言的) corpora (n-gram models)
      • Translation model:
        • Word co-occurrences in parallel corpora
        • i.e. English-French sentence pairs
    • parallel corpora

      • One text in multiple languages
      • Produced by human translation
        • Bible, news articles, legal transcripts, literature, subtitles
        • Open parallel corpus: http://opus.nlpl.eu/
    • models of translation

      • how to learn P ( f ∣ e ) P(f|e) P(fe) from paralell text?
      • We only have sentence pairs; words are not aligned in the parallel text
      • I.e. we don’t have word to word translation 在这里插入图片描述
    • alignment

      • Idea: introduce word alignment as a latent variable into the model

        • P ( f , a ∣ e ) P(f,a|e) P(f,ae)
      • Use algorithms such as expectation maximisation (EM) to learn (e.g. GIZA++) 在这里插入图片描述

      • complexity

        • some words are dropped and have no alignment 在这里插入图片描述

        • One-to-many alignment 在这里插入图片描述

        • many-to-one alignment 在这里插入图片描述

        • many-to-many alignment 在这里插入图片描述

    • summary

      • A very popular field of research in NLP prior to 2010s
      • Lots of feature engineering
      • State-of-the-art systems are very complex
        • Difficult to maintain
        • Significant effort needed for new language pairs

Neural MT

  • introduction

    • Neural machine translation is a new approach to do machine translation
    • Use a single neural model to directly translate from source to target
    • from model perspective, a lot simpler
    • from achitecture perspective, easier to maintain
    • Requires parallel text
    • Architecture: encoder-decoder model
      • 1st RNN to encode the source sentence
      • 2nd RNN to decode the target sentence 在这里插入图片描述
  • neural MT

    • The decoder RNN can be interpreted as a conditional language model

      • Language model: predicts the next word given previous words in target sentence y
      • Conditional: prediction is also conditioned on the source sentence x
    • P ( y ∣ x ) = P ( y 1 ∣ x ) P ( y 2 ∣ y 1 , x ) . . . P ( y t ∣ y 1 , . . . , y t − 1 , x ) P(y|x)=P(y_1|x)P(y_2|y_1,x)...P(y_t|\color{blue}{y_1,...,y_{t-1}},\color{red}{x}) P(yx)=P(y1x)P(y2y1,x)...P(yty1,...,yt1,x)

    • training

      • Requires parallel corpus just like statistical MT

      • Trains with next word prediction, just like a language model

      • loss 在这里插入图片描述

        • During training, we have the target sentence
        • We can therefore feed the right word from target sentence, one step at a time
    • decoding at test time 在这里插入图片描述

      • But at test time, we don’t have the target sentence (that’s what we’re trying to predict!)

      • argmax: take the word with the highest probability at every step

      • exposure bias

        • Describes the discrepancy(差异) between training and testing
        • Training: always have the ground truth tokens at each step
        • Test: uses its own prediction at each step
        • Outcome: model is unable to recover from its own error(error propagation) 在这里插入图片描述
      • greedy decoding

        • argmax decoding is also called greedy decoding
        • Issue: does not guarantee optimal probability P ( y ∣ x ) P(y|x) P(yx)
      • exhaustive search decoding

        • To find optimal P ( y ∣ x ) P(y|x) P(yx), we need to consider every word at every step to compute the probability of all possible sequences
        • O ( V n ) O(V^n) O(Vn) where V = vocab size; n = sentence length
        • Far too expensive to be feasible
      • beam search decoding

        • Instead of considering all possible words at every step, consider k best words
        • That is, we keep track of the top-k words that produce the best partial translations (hypotheses) thus far
        • k = beam width (typically 5 to 10)
        • k = 1 = greedy decoding
        • k = V = exhaustive search decoding
        • example: 在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述
      • when to stop

        • When decoding, we stop when we generate token
        • But multiple hypotheses may terminate their sentence at different time steps
        • We store hypotheses that have terminated, and continue explore those that haven’t
        • Typically we also set a maximum sentence length that can be generated (e.g. 50 words)
    • issues of NMT

      • Information of the whole source sentence is represented by a single vector
      • NMT can generate new details not in source sentence
      • NMT tend to generate not very fluent sentences ( × \times ×, usually fluent, a strength)
      • Black-box model; difficult to explain when it doesn’t work
    • summary

      • Single end-to-end model
        • Statistical MT systems have multiple subcomponents
        • Less feature engineering
        • Can produce new details that are not in the source sentence (hallucination:错觉,幻觉)

Attention Mechanism

在这里插入图片描述

  • With a long source sentence, the encoded vector is unlikely to capture all the information in the sentence
  • This creates an information bottleneck(cannot capture all information in a long sentence in a single short vector)
  • attention
    • For the decoder, at every time step allow it to ‘attend’ to words in the source sentence 在这里插入图片描述在这里插入图片描述在这里插入图片描述在这里插入图片描述

    • encoder-decoder with attention 在这里插入图片描述

    • variants

      • attention
        • dot product: s t T h i s_t^Th_i stThi
        • bilinear: s t T W h i s_t^TWh_i stTWhi
        • additive: v^Ttanh(W_ss_t+W_hh_i)
      • c t c_t ct can be injected to the current state ( s t s_t st), or to the input word ( y t y_t yt)
    • summary

      • Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly(reduce hallucination a bit, direct access to source words, less likely to generate new words not related to source sentence)
      • Provides some form of interpretability (look at attention distribution to see what source word is attended to)
        • Attention weights can be seen as word alignments
      • Most state-of-the-art NMT systems use attention
        • Google Translate (https://slator.com/technology/google-facebook-amazonneural-machine-translation-just-had-its-busiest-month-ever/)

Evaluation

  • MT evaluation
    • BLEU: compute n-gram overlap between “reference” translation(ground truth) and generated translation
    • Typically computed for 1 to 4-gram
      • B L E U = B P × e x p ( 1 N ∑ n N l o g p n ) BLEU=BP\times exp(\frac{1}{N}\sum_n^Nlogp_n) BLEU=BP×exp(N1nNlogpn), where BP → \to “Brevity Penalty” to penalise short outputs
      • p n = #   c o r r e c t   n − g r a m s #   p r e d i c t e d   n − g r a m s p_n=\frac{\# \ correct \ n-grams}{\# \ predicted \ n-grams} pn=# predicted ngrams# correct ngrams
      • B P = m i n ( 1 , o u t p u t   l e n g t h r e f e r e n c e   l e n g t h ) BP=min(1,\frac{output \ length}{reference \ length}) BP=min(1,reference lengthoutput length)

Conclusion

  • Statistical MT
  • Neural MT
    • Nowadays use Transformers rather than RNNs
  • Encoder-decoder with attention architecture is a general architecture that can be used for other tasks
    • Summarisation (lecture 21)
    • Other generation tasks such as dialogue generation

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/628547.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Polarion工作流插件(自定义)

创建插件命名插件配置插件Condition&Function package com.polarion.capital.example.conditions;import com.polarion.alm.tracker.model.IWorkItem; import com.polarion.alm.tracker.workflow.IArguments; import com.polarion.alm.tracker.workflow.ICallContext; impo…

TIME_WAIT的处理方式

TIME_WAIT 是什么 TIME_WAIT 是指在 TCP 连接关闭时,等待所有分组确认被接收的状态,这个状态会持续 2MSL(Maximum Segment Lifetime)的时间,以确保所有分组都被接收。在这段时间内,该连接不能被重用。MSL …

Jmeter 实现 grpc服务 压测

一、Jmeter安装与配置 网上有很多安装与配置文章,在此不做赘述 二、Jmeter gRPC Request 插件安装 插件下载地址:JMeter Plugins :: JMeter-Plugins.org 将下载文件解压后放到Jmeter安装目录下 /lib/ext 然后在终端输入Jmeter即可打开 Jmeter GUI界面…

搞定“超超超难”剑桥面试数学题番外篇:ARM64汇编

0. 概览 在 有趣的小实验:四种语言搞定“超超超难”剑桥面试数学题 那篇博文中,我们使用 4 种语言(x64汇编、C、Swift 以及 Ruby)实现了一道算法题。 不过,其中的汇编语言对应的是 intel CPU 上的 x64 指令集&#x…

手把手教你如何把系统的前端页面改成vue 脚手架项目,实现前后端分离

这篇文章从零开始,介绍怎么把现有的一个商城的页面移动到vue脚手架项目上,实现前后端代码的完全分离。 首先,我们需要新建一个vue的脚手架的项目,在此之前,需要安装node.js 打开IntelliJ IDEA或者winR,输入…

【Java基础学习打卡03】计算机中数据的表示、存储与处理

目录 前言一、数据的表示1.数据与信息2.计算机中的数据3.计算机中数据的单位 二、数据的存储三、数据的处理1.进位计数值2.进制间转换 四、字符编码总结 前言 本小节主要介绍在计算机中数据的表示、存储与处理。要知道计算机内部使用二进制数据,也就是0和1组成的数…

STM32单片机+机智云AIoT+智能服药箱

摘要:随着我国老龄化进程的日趋加快,越来越多的老年人步入了快乐并充实的晚年生活。但是随着年龄的增长,各种医药用品也成了老年人生活的必需品,有人每天甚至需要在不同时间服用多种不同的药物,如果子女不在身边,老年人…

一款基于RT-Thread操作系统的自动测温+风扇自动调速+加湿+自动启停的智能风扇项目(附源码下载)

基于 RT-Thread 的智能加湿风扇 作品背景 一款基于 RT-Thread 操作系统的自动检测温湿度情况进行风扇转速自动调节,加湿自动启停的桌面风扇。 所用硬件: 主控:CH32V307 开发板。温湿度检测:AHT10 温湿度模块。风扇主体&#x…

企业开源测试项目实战(附全套实战项目教程+视频+源码)

接口测试项目 1. No matching distribution found for itypes1.1.0 Could not find a version that satisfies the requirement itypes1.1.0 (from -r requirements.txt (line 8)) (from versions: ) No matching distribution found for itypes1.1.0 (from -r requirements.…

Deepspeed Chat项目理解

ChatGPT的广泛使用促进大模型火起来了,深度学习人工智能开启了2.0时代,deepspeed chat是微软开源的大模型训练工具,它充分利用了deepspeed的高效训练的特点,能够自动化的进行多种大模型训练。 言归正传,在博客中我将对…

什么是大数据,常见的大数据应用领域?

现在大数据发展的如火如荼,也有不少小伙伴对于什么是大数据比较感兴趣,那么大数据在比较官方的定义是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力…

ASP.NET Core Web API入门:创建新项目

ASP.NET Core Web API入门:创建新项目 一、引言二、创建新项目三、加入Startup类,并替换Program.cs内容四、编辑Program.cs代码五、修改控制器的路由六、运行项目 一、引言 最近闲着,想着没真正从0-1开发过ASP.NET Core Web API的项目&#…

Navicat for Redis 与 Navicat Premium 16.2 现已正式发布 | 释放 Redis 全部潜能

今天,我们正式发布 Navicat for Redis 与 Navicat Premium 16.2 两款产品。注入 Redis 能力,这对 Navicat 具有里程碑意义。 此次首发的 Navicat for Redis 数据库管理开发工具,将为 Redis 用户的日常工作带来更为便捷、高效的全新体验。同时…

XV-442-57CQB-1-10伊顿触摸屏EATON

​ XV-442-57CQB-1-10伊顿触摸屏EATON XV-442-57CQB-1-10伊顿触摸屏EATON plc的通信模块是用来完成与别的PLC,其他智能控制设备或计算机之间的通信。以下简单介绍FX系列通信用功能扩展板、适配器及通信模块。 (1)通信扩展板FX2N-232-BD…

第十五章物资需求计划

物料需求计划(MRP)是物料管理(MM)和生产计划(PP)的组成部分。在本章中,您将学习如何使用MRP来优化物流和供应链规划流程。 物料需求计划(MRP)是一种用途广泛、直观的计划…

44 最佳实践-性能最佳实践-裸设备映射

文章目录 44 最佳实践-性能最佳实践-裸设备映射44.1 概述44.2 配置示例 44 最佳实践-性能最佳实践-裸设备映射 44.1 概述 配置虚拟机存储设备时,除了将文件配置给虚拟机作为虚拟磁盘使用外,还可以将块设备(物理LUN、逻辑卷等)直…

VXLAN技术应用场景及测试

定义 RFC7348定义了VLAN扩展方案VXLAN(Virtual eXtensible Local Area Network,虚拟扩展局域网)。 VXLAN采用MAC in UDP(User Datagram Protocol)封装方式,是NVO3(Network Virtualization ove…

设计模式(行为型模式)之:Strategy(策略模式)

文章目录 本质:动机:定义:一个不好的例子策略模式重写总结 本质: 分离算法,选择实现。动机: 在软件构建过程中,某些对象使用的算法可能多种多样,经常改变,如果将这些算…

LeetCode 按摩师 python

目录 1.题目描述 2.普通解法(通过部分测试用例) ​编辑 3.动态规划解法 3.题目总结 1.题目描述 一个有名的按摩师会收到源源不断的预约请求,每个预约都可以选择接或不接。在每次预约服务之间要有休息时间,因此她不能接受相邻…

(六)CSharp-CSharp图解教程版-委托

一、委托概述 1、什么是委托 委托和类一样,是一种用户定义类型(即是一种类,所以也是一个引用类型)。在它们组成的结构方面区别是,类表示的是数据和方法的集合,而委托则持有一个或多个方法。 可以把 deleg…