transformers-Causal lanuage modeling

news2025/4/27 17:44:20

https://huggingface.co/docs/transformers/main/en/tasks/language_modelingicon-default.png?t=N7T8https://huggingface.co/docs/transformers/main/en/tasks/language_modelingcausal lanuage model常用于文本生成。预测token系列中的下一个toekn,并且model只能关注左侧的token,模型看不到右侧的token。

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")
eli5 = eli5.train_test_split(test_size=0.2)
eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

1.Preprocess

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
eli5 = eli5.flatten()
eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}
def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

tokenized_eli5 = eli5.map(
    preprocess_function,
    batched=True,
    num_proc=4,
    remove_columns=eli5["train"].column_names,
)
block_size = 128


def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    if total_length >= block_size:
        total_length = (total_length // block_size) * block_size
    # Split by chunks of block_size.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

2.Train

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

training_args = TrainingArguments(
    output_dir="my_awesome_eli5_clm-model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_dataset["train"],
    eval_dataset=lm_dataset["test"],
    data_collator=data_collator,
)

trainer.train()

3.推理

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

tokenizer.batch_decode(outputs, skip_special_tokens=True)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1161642.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

超声波清洗清洁力强怎么选、适合家用超声波清洗机推荐

因为各种原因很多导致很多小朋友从小就开始近视,佩戴眼镜,眼镜只要是戴上了就很难再摘下来,也有很多朋友从小到大都不知道清洗眼镜的重要性,眼镜长时间不清洗的话上面的细菌堪比茅厕这么脏!所以眼镜清洗千万别忽视了&a…

vue2导出数据生成xlsx文件

1.在utils文件夹新建tool.js tool.js文件 import XEUtils from xe-utilsexport function exportCsv(csv, title) {const t XEUtils.toDateString(Date.now(), yyyy-MM-dd) // 当前日期const filename ${t title}.xlsx // 拼接文件名const blob new Blob([csv]) //创建一…

在Linux上编译gdal3.1.2指南

作者:朱金灿 来源:clever101的专栏 为什么大多数人学不会人工智能编程?>>> 以Ubuntu 18编译gdal3.1.2为例,编译gdal3.1.2需要先编译proj库和geos库(可选)。我选择的proj库版本为proj-7.1.0,编译proj-7.1.0需要先编译tiff库和sqlite3。我选择的sqlite3的版本为…

玩转多个数据库,一个Itbuilder在线工具就搞定!

随着需要使用的数据库类型日渐繁多,开发运维等技术人员如何高效便捷的访问、操作和管理数据,成了一个难题。设计一个好的数据库,就像孩子从小打下的基础,很多项目的失败是由于缺乏适当的数据库设计。因此,选择正确的数…

版本控制系统-SVN

SVN Apache Subversion 通常被缩写成 SVN,是一个开放源代码的版本控制系统。 官网:https://subversion.apache.org 资料:https://svnbook.red-bean.com、https://www.runoob.com/svn/svn-tutorial.html 下载:https://sourceforg…

【LeetCode刷题-哈希】--217.存在重复元素

217.存在重复元素 class Solution {public boolean containsDuplicate(int[] nums) {Set<Integer> hashSet new HashSet<>();for(int i 0;i<nums.length;i){if(hashSet.contains(nums[i])){return true;}hashSet.add(nums[i]);}return false;} }

U盘格式化恢复怎么做?常用的3个方法分享!

“前段时间由于我的u盘中病毒了&#xff0c;我不得已把它格式化了&#xff0c;但是今天我在找一份重要的资料时才发现我的资料在u盘中被一起删除掉了&#xff0c;有什么方法可以帮我找回我u盘中的数据吗&#xff1f;” U盘可以为我们存储各种类型的文件&#xff0c;同时它也很便…

C++标准模板(STL)- 类型支持 (类型属性,is_volatile,is_trivial,is_const)

类型特性 类型特性定义一个编译时基于模板的结构&#xff0c;以查询或修改类型的属性。 试图特化定义于 <type_traits> 头文件的模板导致未定义行为&#xff0c;除了 std::common_type 可依照其所描述特化。 定义于<type_traits>头文件的模板可以用不完整类型实例…

SOLIDWORKS PDM缩短图纸从设计到发布时间

SOLIDWORKS线上协同设计、线上审核、版本管理、任务等大大缩短图纸从设计到发布时间。 在SOLIDWORKS PDM 中工作流程是整个系统的骨架和脉络&#xff0c;所有的文档都需要进入某一工作流程&#xff0c;所有的操作及权限&#xff0c;都依附于特定的工作流程。SOLIDWORKS PDM的工…

一文掌握Java Stream API

引言 Java Stream API 自 Java 8 引入以来&#xff0c;已成为处理集合数据的强大工具。它不仅提高了代码的可读性&#xff0c;还优化了性能&#xff0c;使得集合操作变得更加简洁和高效。本文将深入探讨如何利用 Stream API 的常用操作&#xff0c;帮助你更好地掌握这一强大的…

算法模板之队列解密 | 图文详解

&#x1f308;个人主页&#xff1a;聆风吟 &#x1f525;系列专栏&#xff1a;数据结构、算法模板、汇编语言 &#x1f516;少年有梦不应止于心动&#xff0c;更要付诸行动。 文章目录 &#x1f4cb;前言一. ⛳️模拟队列1.1 &#x1f514;用数组模拟实现队列1.1.1 &#x1f47…

解决远程桌面 这可能是由于CredSSP加密数据库修正问题

运行环境 : Windows Server 2012 R2 Standard 解决方式 策略组 gpedit.msc&#xff0c;注册表 regedit 等方式都尝试无效时&#xff0c;可尝试把下面这个勾勾去掉。

【小黑嵌入式系统第七课】PSoC® 5LP 开发套件(CY8CKIT-050B )——PSoC® 5LP主芯片、I/O系统、GPIO控制LED流水灯的实现

上一课&#xff1a; 【小黑嵌入式系统第六课】嵌入式系统软件设计基础——C语言简述、程序涉及规范、多任务程序设计、状态机建模(FSM)、模块化设计、事件触发、时间触发 文章目录 一、PSoC 5LP主芯片二、PSoC 5LP I/O系统(1) I/O系统特性(2) I/O系统怎样运作&#xff1f;1、I/…

【PC】神秘市场2023

神秘市场2023 我们有一个令人振奋的消息要告诉大家&#xff0c;神秘市场要开张了&#xff01; 据可靠情报&#xff0c;这次全新的神秘市场将返场稀有度高的道具。全新黑货箱也将在藏匿处出现&#xff0c;工坊也会推出全新工匠通行证。不仅如此&#xff0c;特殊制作中也能看到…

《YOLOv8-Pose关键点检测》专栏介绍 CSDN独家改进创新实战 专栏目录

YOLOv8-Pose关键点检测专栏介绍&#xff1a;http://t.csdnimg.cn/gRW1b ✨✨✨手把手教你从数据标记到生成适合Yolov8-pose的yolo数据集&#xff1b; &#x1f680;&#x1f680;&#x1f680;模型性能提升、pose模式部署能力&#xff1b; &#x1f349;&#x1f349;&#…

【Java】医院云HIS信息管理系统源码:实现检验、检查、心理CT、B超等医技报告查看

云HIS采用主流成熟技术&#xff0c;软件结构简洁、代码规范易阅读&#xff0c;SaaS 应用&#xff0c;全浏览器访问前后端分离&#xff0c;多服务协同&#xff0c;服务可拆分&#xff0c;功能易扩展&#xff1b;支持多样化灵活配置&#xff0c;提取大量公共参数&#xff0c;无需…

Debian或Ubuntu静态交叉编译arm和aarch64

Debian或Ubuntu静态交叉编译arm和aarch64 介绍术语ARM架构前置条件从源代码编译一个简单的C程序configure和make交叉编译关于静态链接和依赖关系使用 musl libc 实现与 configure 和 make 进行交叉编译 ARM 正在获得越来越多的关注&#xff0c;并且越来越受欢迎。直接在这些基于…

知行之桥EDI系统如何连接人大金仓数据库?

近年来受到国际形势与国内发展规划等因素的影响&#xff0c;“国产替代”口号逐渐深入人心&#xff0c;越来越多的企业开始考虑用国产软件替代国外软件&#xff0c;尤其是在当前的大数据时代&#xff0c;计算能力主要依赖于数据中心。 然而&#xff0c;20世纪70年代以来&#…

基于springboot实现原创歌曲分享平台系统项目【项目源码+论文说明】

基于springboot实现原创歌曲分享平台系统演示 摘要 随着信息技术和网络技术的飞速发展&#xff0c;人类已进入全新信息化时代&#xff0c;传统管理技术已无法高效&#xff0c;便捷地管理信息。为了迎合时代需求&#xff0c;优化管理效率&#xff0c;各种各样的管理平台应运而生…

Python3.8引入海象运算符(Walrus Operator)的一些认识

1、引言 这个运算符为什么跟海象联系在一起呢&#xff1f;因为看起来像是躺着的海象&#xff0c;其中眼睛和牙齿的形象表示就是海象运算符了&#xff0c;如下图&#xff1a; 海象运算符&#xff0c;也叫做海象选择器&#xff0c;在Go语言中很常见&#xff0c;Python3.8也开始引…