论文学习-Bert 和GPT 有什么区别?

news2024/12/29 9:49:18

Foundation Models, Transformers, BERT and GPT

总结一下:

  • Bert 是学习向量表征,让句子中某个词的Embedding关联到句子中其他重要词。最终学习下来,就是词向量的表征。这也是为什么Bert很容易用到下游任务,在做下游任务的时候,需要增加一些MLP对这些特征进行分类啥的,也就是所谓的微调fine-tune。在Bert的训练中,采用了MASK(完形填空)的思想,用句子中的其他词去预测被挖空的词–Self-Supervised Learning(不需要给句子label,只需要挖空)。这也是Bert不需要Decoder的原因。

  • GPT在做生成,结果是下一个特定词被选中的概率。给一个句子,去生成下一个字,然后再把这个字包含到句子中,重新送入模型,再生成下一个字。周而复始。我能理解这个任务用Decoder可以完成,但为什么这个过程不加入encoder了。–后面看到之后再补充

Bert和GPT都属于预训练模型,在预训练阶段,只不过在目标函数的选取上,Bert采用了完型填空的训练方式,GPT选择的是给定一句话预测下一个字的训练方式。在微调阶段,GPT选择使用两个目标函数结合的方式进行微调,而Bert的话,需要结合任务添加一些层对语义特征进行处理。

GPT选择的方式相对Bert要更困难,预测未来比预测中间状态要难得多,这也是为什么OpenAI要将模型的规模一直做大,才能达到GPT3.5 、GPT4的这种效果。


补充
之前李沐老师的视频里面其实也有讲,但是没记住。论带着问题学习的重要性 -_-

Transformer有两个东西,一个是encoder、一个是decoder。区别在于,encoder在对第i个元素抽取特征时,可以看到整个序列里面的所有元素。而decoder因为有掩码的存在,在对第i个元素抽取特征时,只能看到当前元素和它之前的元素,当前位置后面的词通过一个掩码使得在计算注意力机制的时候变成0。因为是标准的语言模型,只对前预测。对第i个词进行预测的时候,不能看到之后的词。所以GPT(Generative Pre-Training)使用的只是decoder。


学习链接Blog—完全转载

  • https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/
    在这里插入图片描述

Since I’m excited by the incredible capabilities which technologies like ChatGPT and Bard provide, I’m trying to understand better how they work. This post summarizes my current understanding about foundation models, transformers, BERT and GPT.

Note that I’m only learning these concepts and not everything might be fully correct but might help some people to understand the high level concepts.

I know that there are many more and more modern Foundation Models than BERT and GPT, but I want to start ‘simple’ and these two models are probably the most known ones these days.

The technologies below are not trivial and there are lots of articles, papers and full courses even on certain aspects of each technology only. Instead of going into detail, I try to explain what they do and what concepts they use.

Foundation Models

BERT and GPT are both foundation models. Let’s look at the definition and characteristics:

  • Pre-trained on different types of unlabeled datasets (e.g., language and images)
  • Self-supervised learning
  • Generalized data representations which can be used in multiple downstream tasks (e.g., classification and generation)
  • The Transformer architecture is mostly used, but not mandatory

Read my blog Foundation Models at IBM to find out more.

Transformer Architecture

Most foundation models use the transformer architecture. Let’s look at the definition:

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing and computer vision.

In 2017 transformers were introduced: Attention is all you need. They are the next generation of Recurrent Neural Networks and Long Short-Term Memory architectures and have several benefits:

  • Parallel processing: Increases performance and scalability
  • Bidirectionality: Allows understanding of ambiguous words and coreferences

The original transformer architecture defines two main parts, an encoder and a decoder. However, not all foundation models use both parts. BERT only uses encoders, GPT only decoders. More on this later.

Attention

Both encoders and decoders use the concept of ‘attention’. Attention basically means to focus on the important pieces of information and to blend out the unimportant pieces. I like to compare this with ‘fast reading’. Rather than reading full articles or even full books, I often browse chapter titles, first words of paragraphs and scan paragraphs for keywords to find what I’m looking for.

The words of an article, the parts of an image or the words in a sentence that should get most attention change dependent on what you are looking for. Let’s look at a simple example sentence:

“Sarah went to a restaurant to meet her friend that night.”

The following words should get attention for the following queries:

  • What? -> ‘went’, ‘meet’
  • Where? -> ‘a restaurant’
  • Who? -> ‘Sarah’, ‘her friend’
  • When? -> ‘that night’

To determine the attention of words (more exactly tokens) ‘queries’, ‘keys’ and ‘values’ are used by encoders and decoders in transformers. All of them are presented in vectors. Keys are found for certain queries if they are closest to the query vector. Keys are an encoded representation for values, in simple cases they can be the same.

There are different algorithms to implement the attention concept. I think an easy way to understand how this can work is to rank words high that are often used together in sentences. For example, ‘where’ and ‘restaurant’ have probably a closer relation than ‘restaurant’ and ‘faith’. So, for the query ‘where’ the word ‘restaurant’ gets more attention.

Encoders and Decoders

As mentioned, there are encoders and decoders. BERT uses encoders only, GTP uses decoders only. Both options understand language including syntax and semantics. Especially the next generation of large language models like GPT with billions of parameters do this very well.

The two models focus on different scenarios. However, since the field of foundation models is evolving, the differentiation is often fuzzier.

  • BERT (encoder): classification (e.g., sentiment), questions and answers, summarization, named entity recognition
  • GPT (decoder): translation, generation (e.g., stories)

The outputs of the core models are different:

  • BERT (encoder): Embeddings representing words with attention information in a certain context
  • GPT (decoder): Next words with probabilities

Both models are pretrained and can be reused without intensive training. Some of them are available as open source and can be downloaded from communities like Hugging Face, others are commercial. Reuse is important, since trainings are often very resource intensive and expensive which few companies can afford.

The pretrained models can be extended and customized for different domains and specific tasks. Layers can sometimes be reused without modifications and more layers are added on top. If layers need to be modified, the new training is more expensive. The technique to customize these models is called Transfer Learning, since the same generic model can easily be transferred to other domains.

BERT - Encoders

BERT uses the encoder part of the transformer architecture so that it understands semantic and syntactic language information. The output of BERT are embeddings, not predicted next words. To leverage these embeddings, other layer(s) need to be added on top, for example text classification or questions and answers.

BERT uses a genius trick for the training. For supervised training it is often expensive to get labeled data, sometimes it’s impossible. The trick is to use masks as I described in my post Evolution of AI explained via a simple Sample. Let’s take a simple example, an unlabeled sentence:

“Sarah went to a restaurant to meet her friend that night.”

This is converted into:

  • Text: “Sarah went to a restaurant to meet her MASK that night.”
  • Label: “Sarah went to a restaurant to meet her friend that night.”

Note that this is a very simplified description only since there aren’t ‘real’ labels in BERT.

In other words, BERT produces labeled data for originally unlabeled data. This technique is called Self-Supervised Learning. It works very well for huge amounts of data.

In masked language models like BERT, each masked word (token) prediction is conditioned on the rest of the tokens in the sentence. These are received in the encoder which is why you don’t need a decoder.

GPT - Decoders

In language scenarios decoders are used to generate next words, for example when translating text or generating stories. The outputs are words with probabilities.

Decoders also use the attention concepts and even two times. First when training models, they use Masked Multi-Head Attention which means that only the first words of the target sentence are provided so that the model can learn without cheating. This mechanism is like the MASK concept from BERT.

After this the decoder uses Multi-Head Attention as it’s also used in the encoder. Transformer based models that utilize encoders and decoders use a trick to be more efficient. The output of the encoders is feed as input to the decoders, more precisely the keys and values. Decoders can invoke queries to find the closest keys. This allows, for example, to understand the meaning of the original sentence and translate it into other languages even if the number of resulting words and the order changes.

GPT doesn’t use this trick though and only use a decoder. This is possible since these types of models have been trained with massive amounts of data (Large Language Model). The knowledge of encoders is encoded in billions of parameters (also called weights). The same knowledge exists in decoders when trained with enough data.

Note that ChatGPT has evolved these techniques. To prevent hate, profanity and abuse, humans need to label some data first. Additionally Reinforcement Learning is applied to improve the quality of the model (see ChatGPT: Optimizing Language Models for Dialogue).

Resources

There are many good articles, videos and courses. Here are some of the ones I read or watched:

  • Course: Natural Language Processing Demystified
  • YouTube channel: CodeEmporium
  • Article: What Is ChatGPT Doing … and Why Does It Work?
  • Article: 10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
  • Article: Transformer’s Encoder-Decoder: Let’s Understand The Model Architecture
  • NLP - BERT & Transformer

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1273529.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Windows核心编程 HOOK

目录 HOOK概述 HOOK API SetWindowsHookExA 函数(winuser.h) UnhookWindowsHookEx 函数(winuser.h) NextHookEx 函数(winuser.h) 局部钩子 全局钩子 为什么全局钩子需要用dll作为过程函数? HOOK概述 本质:Windows消系统的消息过滤器。 全局钩子…

如何提高3D建模技能?

无论是制作影视动画还是视频游戏,提高3D建模技能对于你的工作都至关重要的。那么如何能创建出精美的3D模型呢?本文给大家一些3D建模技能方面的建议。 3D建模通过专门的软件完成,涉及制作三维对象。这项技能在视频游戏开发、建筑、动画和产品…

电商物流信息查询难?一招解决

在当今的电商时代,物流信息查询一直是电商行业的痛点。很多电商商家在处理大量快递订单时,经常需要手动一个个查询快递信息,不仅耗时而且耗力。为了解决这个问题,我们可以使用固乔快递查询助手,它可以帮助我们快速批量…

redis单机版本安装

redis单机版本安装 1.redis单机版源码编译安装搭建(4.0示例) redis下载地址 https://redis.io/download redis源码编译 #!/bin/sh yum install -y wget gcc gcc-c make tar openssl openssl-devel cmakecd /usr/local/src wget http://download.redis.io/releases/redis-4…

什么是Anaconda?作用是?使用python必须要安装嘛?

一、什么是Anaconda以及其作用? 通俗来讲,Anaconda算是一个环境容器,也可以叫环境管理器。 作用:可以在Anaconda容器中为python项目创建不同的环境。在各个不同环境中可以安装不同版本的包并且各个环境互不影响。可以在使用不同项…

Maven的安装和使用

Maven是一个基于项目对象模型(POM),可以管理项目构建、依赖管理、项目报告等的工具,使构建Java项目更容易。可以说Maven是一个项目管理和构建工具,它可以从管理项目的角度出发,将开发过程中的需求纳入进来&…

微信小程序 slider 翻转最大和最小值

微信小程序 slider 翻转最大和最小值 场景代码示例index.wxmlindex.jsutil.js 参考资料 场景 我想使用 slider 时最左边是 10 最右是 -10。 但是想当然的直接改成<slider min"10" max"-10" step"1" /> 并没用。 查了文档和社区也没有现成…

<Linux>(极简关键、省时省力)《Linux操作系统原理分析之存储管理(3)》(16)

《Linux操作系统原理分析之存储管理&#xff08;3&#xff09;》&#xff08;16&#xff09; 5 存储管理5.6 分段存储管理&#xff08;1&#xff09; 地址结构&#xff08;2&#xff09; 段表&#xff08;3&#xff09; 地址变换机构&#xff08;4&#xff09; 存储共享&#x…

自动化测试理论

一、初识自动化测试 概念 软件测试的定义 在规定的条件下对程序进行操作&#xff0c;以发现程序错误&#xff0c;衡量软件质量&#xff0c;并对其是否能满足设计要求进行评估的过程。 自动化测试的定义 把人对软件的测试行为转化为由机器执行测试行为的一种实践&#xff0…

芯片技术探索:了解构芯片的设计与制造之旅

芯片技术探索:了解构芯片的设计与制造之旅 一、引言 随着现代科技的飞速发展,芯片作为信息技术的核心,已经渗透到我们生活的方方面面。从智能手机、电视、汽车到医疗设备和工业控制系统,芯片在各个领域都发挥着至关重要的作用。然而,对于大多数人来说,芯片仍然是一个神秘…

【Flink进阶】-- Flink kubernetes operator 快速入门与实战

1、课程目录 2、课程链接 https://edu.csdn.net/course/detail/38831

发朋友圈的重要性和黄金时间

一、为什么发朋友圈要选择时间发&#xff1f; 1. 增加曝光率&#xff1a;通过定时自动发朋友圈&#xff0c;可以让更多的人看到你的动态。 2. 提高互动率&#xff1a;定时自动发朋友圈可以保持你的朋友圈活跃度&#xff0c;提高与粉丝的互动率。 3. 增强品牌形象&#xff1a…

系统设计面试指南之分布式任务调度

1 简介 任务是需要资源(CPU 时间、内存、存储、网络带宽等)在指定时间内完成的一段计算工作。 通过智能地将资源分配给任务以满足任务级和系统级目标的系统称为任务调度程序。 任务调度程序&#xff1a; 及时决定和分配资源给任务的过程称为任务调度。 当我们在 Facebook 发…

Golang语言基础之切片

概述 数组的长度是固定的并且数组长度属于类型的一部分&#xff0c;所以数组有很多的局限性 func arraySum(x [3]int) int{sum : 0for _, v : range x{sum sum v}return sum } 这个求和函数只能接受 [3]int 类型&#xff0c;其他的都不支持。 切片 切片&#xff08;Slic…

2023长三角(芜湖)人工智能数字生态峰会暨视觉算法大赛颁奖典礼邀您一同参与!

时光荏苒&#xff0c;科技飞速发展&#xff0c;人工智能正在以令人瞩目的速度改变着我们的生活。在这个数字科技的新时代&#xff0c;为促进长三角地区人工智能产业的蓬勃发展&#xff0c;助推数字经济的繁荣&#xff0c;2023长三角&#xff08;芜湖&#xff09;人工智能数字生…

php请求okx接口获取比特币价格数据、k线数据

php请求okx接口获取比特币价格数据 环境配置请求头、签名设置签名配置代理 全部代码 环境 我本地用的是thinkphp框架和guzzle 安装guzzle composer require guzzlehttp/guzzle 配置请求头、签名 我们需要准备api_key&#xff0c;secret_key&#xff0c;passphrase api_key…

软件测试的测试文档怎么编写?

在软件测试中的流程中&#xff0c;测试文档也是一个重要的流程&#xff0c;所以测试人员也需要学习测试文档的编写和阅读。 一定义&#xff1a; 测试文档&#xff08;TestingDocumentation&#xff09;记录和描述了整个测试流程&#xff0c;它是整个测试活动中非常重要的文件。…

预约系统源码解析:打造智能定制化预约服务的技术奇迹

在当今数字化时代&#xff0c;预约系统的重要性日益凸显&#xff0c;而预约系统源码的开放将为各行业带来更加灵活、智能的预约解决方案。本文将深入探讨预约系统源码的技术内幕&#xff0c;为开发者提供实用的代码示例&#xff0c;助力打造智能定制化的预约服务。 技术栈概览…

泛域名SSL证书是什么?泛域名SSL证书价格多少钱?

泛域名SSL证书是一种SSL证书类型&#xff0c;也被称为通配符SSL证书。SSL证书是保护网站数据传输安全及服务器身份可信的数字证书产品&#xff0c;通常绑定域名或IP&#xff0c;配置到网站服务器上。SSL证书根据保护域名数量及域名类型的不同&#xff0c;可以分为单域名SSL证书…

分布式系统:CAP 定理

欢迎来到分布式系统系列。在本文中&#xff0c;我们将学习并理解什么是 CAP 定理。CAP 代表一致性、可用性和分区容错性。当我们谈论CAP定理时&#xff0c;我们主要谈论的是分布式系统。首先&#xff0c;让我们了解一下什么是分布式系统。分布式系统是由运行在单台或多台机器上…