使用Hugging Face构建大型语言模型应用

news2024/12/28 5:22:58

在本文中,我们将介绍如何使用Hugging Face的大型语言模型(LLM)构建一些常见的应用,包括摘要(Summarization)、情感分析(Sentiment analysis)、翻译(Translation)、零样本分类(Zero-shot classification)和少样本学习(Few-shot learning)。我们将探索现有的开源和专有模型,展示如何直接应用于各种应用场景。同时,我们还将介绍简单的提示工程(prompt engineering),以及如何使用Hugging Face的API配置LLM管道。

学习目标:

  • 使用各种现有模型构建常见应用。

  • 理解基本的提示工程。

  • 了解LLM推理中的搜索和采样方法。

  • 熟悉Hugging Face的主要抽象概念:数据集、管道、分词器和模型。

环境安装

在这里插入图片描述

常见的大预言模型应用

本节旨在让您对几种常见的LLM应用有所了解,并展示使用LLM的入门方法的简易性。

在浏览示例时,请注意所使用的数据集、模型、API和选项。这些简单的示例可作为构建自己应用程序的起点。

在这里插入图片描述

摘要(Summarization)

摘要可以分为两种形式:

  1. 抽取式摘要(extractive):从文本中选择代表性的摘录作为摘要。

  2. 生成式摘要(abstractive):通过生成新的文本来形成摘要。

在本文中,我们将使用生成式摘要模型。

背景阅读:Hugging Face的摘要任务页面列出了支持摘要的模型架构。摘要章节提供了详细的操作指南。

在本节中,我们将使用以下内容:

  • 数据集:xsum数据集,该数据集提供了一系列BBC新闻文章和相应的摘要。

  • 模型:t5-small模型,该模型具有6000万个参数(对于PyTorch而言是242MB)。T5是由Google创建的编码器-解码器模型,支持多个任务,包括摘要、翻译、问答和文本分类。有关更多详细信息,请参阅Google的博客文章、GitHub上的代码或研究论文。

数据集加载

`xsum_dataset = load_dataset(
       "xsum", version="1.2.0", cache_dir="/root/home/LLMs/week1"   
)  # Note: We specify cache_dir to use predownloaded data.   
xsum_dataset  # The printed representation of this object shows the `num_rows` of each dataset split.      

# 输出   
DatasetDict({    
    train: Dataset({           
        features: ['document', 'summary', 'id'],           
        num_rows: 204045       
    })       
    validation: Dataset({      
        features: ['document', 'summary', 'id'],           
        num_rows: 11332       
    })       
    test: Dataset({      
        features: ['document', 'summary', 'id'],           
        num_rows: 11334       
    })   
})   `

该数据集提供了三列:

  • document:包含BBC文章的文本内容。

  • summary:一个“ground-truth”摘要。请注意,“ground-truth”摘要是主观的,可能与您所写的摘要不同。这是一个很好的例子,说明许多LLM应用程序没有明显的“正确”答案。

  • id:文章的唯一标识符。

xsum_sample = xsum_dataset["train"].select(range(10))   
display(xsum_sample.to_pandas())   

接下来,我们将使用Hugging Face的pipeline工具加载一个预训练模型。在LLM(Language Model)pipeline的构造函数中,我们需要指定以下参数:

  • task:第一个参数用于指定主要任务。您可以参考Hugging Face的task文档获取更多信息。

  • model:这是从Hugging Face Hub加载的预训练模型的名称。

  • min_length、max_length:我们可以设置生成的摘要的最小和最大标记长度范围。

  • truncation:一些输入文章可能过长,超出了LLM处理的限制。大多数LLM模型对输入序列的长度有固定的限制。通过设置此选项,我们可以告诉pipeline在需要时对输入进行截断。

在这里插入图片描述

# Apply to 1 article   
summarizer(xsum_sample["document"][0])      

# Apply to a batch of articles   
results = summarizer(xsum_sample["document"])      

# Display the generated summary side-by-side with the reference summary and original document.   
# We use Pandas to join the inputs and outputs together in a nice format.   
import pandas as pd       

display(    
    pd.DataFrame.from_dict(results)       
    .rename({"summary_text": "generated_summary"}, axis=1)       
    .join(pd.DataFrame.from_dict(xsum_sample))[       
        ["generated_summary", "summary", "document"]       
    ]   
)      

# 输出   
results   
[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'},  
 {'summary_text': 'a fire alarm went off at the Holiday Inn in Hope Street on Saturday . guests were asked to leave the hotel . the two buses were parked side-by-side in'},    
 {'summary_text': 'Sebastian Vettel will start third ahead of team-mate Kimi Raikkonen . stewards only handed Hamilton a reprimand after governing body said "n'},    
 {'summary_text': 'the 67-year-old is accused of committing the offences between March 1972 and October 1989 . he denies all the charges, including two counts of indecency'},    
 {'summary_text': 'a man receiving psychiatric treatment at the clinic threatened to shoot himself and others . the incident comes amid tension in Istanbul following several attacks on the reina nightclub .'},    
 {'summary_text': 'Gregor Townsend gave a debut to powerhouse wing Taqele Naiyaravoro . the dragons gave first starts of the season to wing a'},    
 {'summary_text': 'Veronica Vanessa Chango-Alverez, 31, was killed and another man injured in the crash . police want to trace Nathan Davis, 27, who has links to the Audi .'},    
 {'summary_text': 'the 25-year-old was hit by a motorbike during the Gent-Wevelgem race . he was riding for the Wanty-Gobert team and was taken'},    
 {'summary_text': 'gundogan will not be fit for the start of the premier league season at Brighton on 12 august . the 26-year-old says his recovery time is now being measured in "week'},    
 {'summary_text': 'the crash happened about 07:20 GMT at the junction of the A127 and Progress Road in leigh-on-Sea, Essex . the man, aged in his 20s'}]      

情感分析(Sentiment analysis)

情感分析是一种文本分类任务,旨在对一段文本进行评估,判断其是积极的、消极的还是其他类型的情感。具体的情感标签可以根据不同的应用而有所不同。

背景阅读:可以参考Hugging Face的文本分类任务页面或维基百科上的情感分析页面,以了解更多相关信息。

在本节中,我们将使用以下内容:

  • 数据集:poem_sentiment,该数据集提供了带有负面(0)、积极(1)、无影响(2)或混合(3)情感标签的诗句。

  • 模型:我们将使用经过微调的BERT模型。BERT(Bidirectional Encoder Representations from Transformers)是Google开发的一种仅包含编码器的模型,可用于处理多达11个不同的任务,包括情感分析和实体识别等。如果需要更详细的信息,可以参考Hugging Face的博客文章或维基百科上的相关页面。
    在这里插入图片描述
    在这里插入图片描述

我们选择使用任务text-classification来加载pipeline,因为我们的目标是对文本进行分类。
在这里插入图片描述

# Display the predicted sentiment side-by-side with the ground-truth label and original text.   
# The score indicates the model's confidence in its prediction.  
     
# Join predictions with ground-truth data   
joined_data = (     
    pd.DataFrame.from_dict(results)       
    .rename({"label": "predicted_label"}, axis=1)       
    .join(pd.DataFrame.from_dict(poem_sample).rename({"label": "true_label"}, axis=1))   
)       #

Change label indices to text labels   
sentiment_labels = {0: "negative", 1: "positive", 2: "no_impact", 3: "mixed"}   
joined_data = joined_data.replace({"true_label": sentiment_labels})       

display(joined_data[["predicted_label", "true_label", "score", "verse_text"]])   

在这里插入图片描述

翻译(Translation)

翻译模型可以专门设计用于支持特定的语言对,也可以支持多种语言。

在本节中,我们将使用以下内容:

  • 数据集:我们将使用一些示例的硬编码句子。然而,Hugging Face提供了各种翻译数据集供使用。

  • 模型:Helsinki-NLP/opus-mt-en-es用于英语(“en”)到西班牙语(“es”)的翻译示例。该模型基于Marian NMT,这是由微软和其他研究人员开发的神经机器翻译框架。请参阅GitHub页面以获取代码和相关资源的链接。t5-small模型,它具有6000万个参数(对于PyTorch而言,大小为242MB)。T5是由Google创建的编码器-解码器模型,支持多种任务,包括摘要、翻译、问答和文本分类。

en_to_es_translation_pipeline = pipeline(    
   task="translation",       
   model="Helsinki-NLP/opus-mt-en-es",       
   model_kwargs={"cache_dir": "/root/home/LLMs/week1"},   
)      
en_to_es_translation_pipeline(       
"Existing, open-source (and proprietary) models can be used out-of-the-box for many applications."   
)      

# 输出   
[{'translation_text': 'Los modelos existentes, de código abierto (y propietario) se pueden utilizar fuera de la caja para muchas aplicaciones.'}]   

一些模型能够支持多种语言翻译。下面,我们使用t5-small模型展示这一点。请注意,由于它支持多种语言(和任务),我们会明确指示它从一种语言翻译到另一种语言。

t5_small_pipeline = pipeline(
    task="text2text-generation",       
    model="t5-small",       
    max_length=50,       
    model_kwargs={"cache_dir": "/root/home/LLMs/week1"},   
)     

t5_small_pipeline(  
    "translate English to Romanian: Existing, open-source (and proprietary) models can be used out-of-the-box for many applications."   
)      

# 输出   
[{'generated_text': 'Modelele existente, deschise (şi proprietăţi) pot fi utilizate în afara legii pentru multe aplicaţii.'}]     
 
t5_small_pipeline(    
   "translate English to Desutchland: I love you."   
)      

# 输出   
[{'generated_text': 'Desutchland: Ich liebe dich.'}]   

零样本分类(Zero-shot classification)

零样本分类(或零样本学习)是一种将文本分为给定的几个类别或标签之一的任务,而无需事先明确训练模型来预测这些类别。这个概念在现代语言模型出现之前就在文献中被提出,但最近语言模型的进展使得零样本学习变得更加灵活和强大。

在本节中,我们将使用以下内容:

  • 数据集:我们将使用上一节摘要部分中提到的xsum数据集中的一些示例文章。我们的目标是为这些新闻文章打上几个类别的标签。

  • 模型:我们将使用nli-deberta-v3-small模型,这是对DeBERTa模型进行微调的版本。DeBERTa基础模型是由微软开发的,它是从BERT模型派生出的几个模型之一。

在这里插入图片描述

zero_shot_pipeline = pipeline(   
    task="zero-shot-classification",       
    model="cross-encoder/nli-deberta-v3-small",       
    model_kwargs={"cache_dir": "/root/home/LLMs/week1"},   
)       

def categorize_article(article: str) -> None:       
    """
    This helper function defines the categories (labels) which the model must use to label articles.       
    Note that our model was NOT fine-tuned to use these specific labels,       
    but it "knows" what the labels mean from its more general training.         
      
    This function then prints out the predicted labels alongside their confidence scores.       
    """
    results = zero_shot_pipeline(        
        article,           
        candidate_labels=[           
            "politics",               
            "finance",               
            "sports",               
            "science and technology",               
            "pop culture",               
            "breaking news",           
        ],       
    )       
    # Print the results nicely       
    del results["sequence"]       
    display(pd.DataFrame(results))   
categorize_article(    
    """   
Simone Favaro got the crucial try with the last move of the game, following earlier touchdowns by Chris Fusaro, Zander Fagerson and Junior Bulumakau.   
Rynard Landman and Ashton Hewitt got a try in either half for the Dragons.   
Glasgow showed far superior strength in depth as they took control of a messy match in the second period.   
Home coach Gregor Townsend gave a debut to powerhouse Fijian-born Wallaby wing Taqele Naiyaravoro, and centre Alex Dunbar returned from long-term injury, while the Dragons gave first starts of the season to wing Aled Brew and hooker Elliot Dee.   
Glasgow lost hooker Pat McArthur to an early shoulder injury but took advantage of their first pressure when Rory Clegg slotted over a penalty on 12 minutes.   
It took 24 minutes for a disjointed game to produce a try as Sarel Pretorius sniped from close range and Landman forced his way over for Jason Tovey to convert - although it was the lock's last contribution as he departed with a chest injury shortly afterwards.   
Glasgow struck back when Fusaro drove over from a rolling maul on 35 minutes for Clegg to convert.   
But the Dragons levelled at 10-10 before half-time when Naiyaravoro was yellow-carded for an aerial tackle on Brew and Tovey slotted the easy goal.   
The visitors could not make the most of their one-man advantage after the break as their error count cost them dearly.   
It was Glasgow's bench experience that showed when Mike Blair's break led to a short-range score from teenage prop Fagerson, converted by Clegg.   
Debutant Favaro was the second home player to be sin-binned, on 63 minutes, but again the Warriors made light of it as replacement wing Bulumakau, a recruit from the Army, pounced to deftly hack through a bouncing ball for an opportunist try.   
The Dragons got back within striking range with some excellent combined handling putting Hewitt over unopposed after 72 minutes.   
However, Favaro became sinner-turned-saint as he got on the end of another effective rolling maul to earn his side the extra point with the last move of the game, Clegg converting.   
Dragons director of rugby Lyn Jones said: "We're disappointed to have lost but our performance was a lot better [than against Leinster] and the game could have gone either way.   
"Unfortunately too many errors behind the scrum cost us a great deal, though from where we were a fortnight ago in Dublin our workrate and desire was excellent.   
"It was simply error count from individuals behind the scrum that cost us field position, it's not rocket science - they were correct in how they played and we had a few errors, that was the difference."   
Glasgow Warriors: Rory Hughes, Taqele Naiyaravoro, Alex Dunbar, Fraser Lyle, Lee Jones, Rory Clegg, Grayson Hart; Alex Allan, Pat MacArthur, Zander Fagerson, Rob Harley (capt), Scott Cummings, Hugh Blake, Chris Fusaro, Adam Ashe.   
Replacements: Fergus Scott, Jerry Yanuyanutawa, Mike Cusack, Greg Peterson, Simone Favaro, Mike Blair, Gregor Hunter, Junior Bulumakau.   
Dragons: Carl Meyer, Ashton Hewitt, Ross Wardle, Adam Warren, Aled Brew, Jason Tovey, Sarel Pretorius; Boris Stankovich, Elliot Dee, Brok Harris, Nick Crosswell, Rynard Landman (capt), Lewis Evans, Nic Cudd, Ed Jackson.   
Replacements: Rhys Buckley, Phil Price, Shaun Knight, Matthew Screech, Ollie Griffiths, Luc Jones, Charlie Davies, Nick Scott.   
"""   
)      

# 输出   
 labels scores   
0 sports 0.469011   
1 breaking news 0.223165   
2 science and technology 0.107025   
3 pop culture 0.104471   
4 politics 0.057390   
5 finance 0.03893   

少样本学习(Few-shot learning)

在少样本学习任务中,我们向模型提供一个指令、一些查询-响应示例,并要求模型生成一个新查询的响应。这种技术具有强大的能力,可以使模型在更广泛的应用中得到重复使用。然而,少样本学习也有其挑战之处,需要进行大量的提示工程才能获得良好且可靠的结果。

在本节中,我们将使用以下内容:

  • 任务:少样本学习可以应用于多种任务。在本例中,我们将进行情感分析任务,该任务在之前已经进行了介绍。然而,通过少样本学习,我们可以自定义标签,而不仅仅局限于之前模型所调整的特定标签集。此外,我们还将展示其他玩具任务。在Hugging Face的任务指定中,少样本学习被视为文本生成任务。

  • 数据:我们使用了一些示例数据,其中包括了一篇博客文章中的推文示例。

  • 模型:我们使用了gpt-neo-1.3B模型,这是GPT-Neo模型的一个版本。GPT-Neo是由Eleuther AI开发的Transformer模型,具有13亿个参数。如果您想了解更多详细信息,请参阅GitHub上的代码或相关的研究论文。

在这里插入图片描述
Tip: 在下面的少样本提示中,我们将使用特殊标记"###"来分隔示例,并使用相同的标记来提示语言模型在回答查询后结束输出。我们将告诉pipeline将该特殊标记作为序列的结束标记(EOS)。

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

`# This example sometimes works and sometimes does not, when sampling.  Too abstract?   
results = few_shot_pipeline(   
    """Given a word describing how someone is feeling, suggest a description of that person.  The description should not include the original word.       

[word]: happy   
[description]: smiling, laughing, clapping   
###   
[word]: nervous   
[description]: glancing around quickly, sweating, fidgeting   
###   
[word]: sleepy   [description]: heavy-lidded, slumping, rubbing eyes   #
##   
[word]: confused   
[description]:""",     
    eos_token_id=eos_token_id,   
)       

print(results[0]["generated_text"])      

# 输出   
Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.   
Given a word describing how someone is feeling, suggest a description of that person.  The description should not include the original word.       

[word]: happy   
[description]: smiling, laughing, clapping   
###   
[word]: nervous   
[description]: glancing around quickly, sweating, fidgeting   
###   
[word]: sleepy   
[description]: heavy-lidded, slumping, rubbing eyes   
###  
[word]: confused   
[description]: staring at one's own reflection in the water,   `
`# We override max_new_tokens to generate longer answers.   
# These book descriptions were taken from their corresponding Wikipedia pages.   
results = few_shot_pipeline(   
    """Generate a book summary from the title:       
    
[book title]: "Stranger in a Strange Land"   
[book description]: "This novel tells the story of Valentine Michael Smith, a human who comes to Earth in early adulthood after being born on the planet Mars and raised by Martians, and explores his interaction with and eventual transformation of Terran culture."   
###   
[book title]: "The Adventures of Tom Sawyer"   
[book description]: "This novel is about a boy growing up along the Mississippi River. It is set in the 1840s in the town of St. Petersburg, which is based on Hannibal, Missouri, where Twain lived as a boy. In the novel, Tom Sawyer has several adventures, often with his friend Huckleberry Finn."   
###   
[book title]: "Dune"   
[book description]: "This novel is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an inhospitable and sparsely populated desert wasteland, it is the only source of melange, or spice, a drug that extends life and enhances mental abilities.  The story explores the multilayered interactions of politics, religion, ecology, technology, and human emotion, as the factions of the empire confront each other in a struggle for the control of Arrakis and its spice."   
###   
[book title]: "Blue Mars"   
[book description]:""",    
    eos_token_id=eos_token_id,       
    max_new_tokens=50,   
)       

print(results[0]["generated_text"])         

# 输出   
Setting `pad_token_id` to `eos_token_id`:21017 for open-end generation.   
Generate a book summary from the title:       

[book title]: "Stranger in a Strange Land"  
[book description]: "This novel tells the story of Valentine Michael Smith, a human who comes to Earth in early adulthood after being born on the planet Mars and raised by Martians, and explores his interaction with and eventual transformation of Terran culture."   
###   
[book title]: "The Adventures of Tom Sawyer"   
[book description]: "This novel is about a boy growing up along the Mississippi River. It is set in the 1840s in the town of St. Petersburg, which is based on Hannibal, Missouri, where Twain lived as a boy. In the novel, Tom Sawyer has several adventures, often with his friend Huckleberry Finn."   
###   
[book title]: "Dune"   
[book description]: "This novel is set in the distant future amidst a feudal interstellar society in which various noble houses control planetary fiefs. It tells the story of young Paul Atreides, whose family accepts the stewardship of the planet Arrakis. While the planet is an inhospitable and sparsely populated desert wasteland, it is the only source of melange, or spice, a drug that extends life and enhances mental abilities.  The story explores the multilayered interactions of politics, religion, ecology, technology, and human emotion, as the factions of the empire confront each other in a struggle for the control of Arrakis and its spice."   
###   
[book title]: "Blue Mars"  
[book description]: "This is a post-apocalyptic story about humanity, where the last survivors are forced to struggle to build a new world. A small group of survivors are forced to build a new civilization in order to survive and learn the lessons of the past."   `

提示工程(Prompt engineering)在与语言模型交互时起着关键作用。随着使用更通用和强大的模型,如GPT-3.5,构建良好的提示变得越来越重要。好的提示可以帮助模型更好地理解用户的意图,并生成准确、有用的响应。

Hugging Face API

在本节中,我们将深入介绍一些Hugging Face API的细节。

  • 搜索和采样生成文本

  • Auto*自动加载器

  • 特定模型的分词器和模型加载器

回顾一下上文的Summarization部分使用到的xsum数据集:

display(xsum_sample.to_pandas())      

# 输出    
document summary id   
0 The full cost of damage in Newton Stewart, one... Clean-up operations are continuing across the ... 35232142   
1 A fire alarm went off at the Holiday Inn in Ho... Two tourist buses have been destroyed by fire ... 40143035   
2 Ferrari appeared in a position to challenge un... Lewis Hamilton stormed to pole position at the... 35951548   
3 John Edward Bates, formerly of Spalding, Linco... A former Lincolnshire Police officer carried o... 36266422   
4 Patients and staff were evacuated from Cerahpa... An armed man who locked himself into a room at... 38826984   
5 Simone Favaro got the crucial try with the las... Defending Pro12 champions Glasgow Warriors bag... 34540833   
6 Veronica Vanessa Chango-Alverez, 31, was kille... A man with links to a car that was involved in... 20836172   
7 Belgian cyclist Demoitie died after a collisio... Welsh cyclist Luke Rowe says changes to the sp... 35932467   
8 Gundogan, 26, told BBC Sport he "can see the f... Manchester City midfielder Ilkay Gundogan says... 40758845   
9 The crash happened about 07:20 GMT at the junc... A jogger has been hit by an unmarked police ca... 30358490   

在这里插入图片描述

推理过程中的搜索与采样

在Hugging Face的pipeline中,有一些与推理相关的参数,例如num_beams和do_sample。

语言模型通过预测下一个标记来工作,然后是下一个标记,以此类推。目标是生成一个高概率的标记序列,这实际上是在潜在的序列空间中进行搜索。

为了进行这种搜索,主要有以下两种方法:

  • 搜索:根据已生成的标记,从候选标记中选择下一个最有可能的标记。

  • 1)贪婪搜索(默认):在贪婪搜索中,选择单个最有可能的下一个标记。

  • 2)束搜索(Beam search):通过束搜索,在多个序列路径上进行搜索,通过参数num_beams来控制搜索的路径数。

  • 采样:根据已生成的标记,通过从预测的标记分布中进行采样来选择下一个标记。

  • 1)Top-K采样:通过将采样限制为最有可能的k个标记来修改采样。

  • 2)Top-p采样:通过将采样限制为最有可能的概率总和为p的标记来修改采样。

您可以通过设置参数do_sample来切换搜索和采样方法。选择合适的搜索和采样方法可以影响生成文本的质量和多样性,根据任务需求选择合适的方法非常重要。

# We previously called the summarization pipeline using the default inference configuration.   
# This does greedy search.   
summarizer(xsum_sample["document"][0])      
# 输出   
[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]   
# We can instead do a beam search by specifying num_beams.   
# This takes longer to run, but it might find a better (more likely) sequence of text.   
summarizer(xsum_sample["document"][0], num_beams=10)      

# 输出   
[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]   
# Alternatively, we could use sampling.   
summarizer(xsum_sample["document"][0], do_sample=True)      

# 输出   
[{'summary_text': 'many businesses and householders were affected by flooding in Newton Stewart . the water breached a retaining wall, flooding many commercial properties . a flood alert remains in place across'}]   
# We can modify sampling to be more greedy by limiting sampling to the top_k or top_p most likely next tokens.   
summarizer(xsum_sample["document"][0], do_sample=True, top_k=10, top_p=0.8)      

# 输出   
[{'summary_text': 'the full cost of damage in Newton Stewart is still being assessed . many roads in peeblesshire remain badly affected by standing water . a flood alert remains in place across the'}]   

Auto*自动加载器

我们已经了解了Hugging Face的数据集和流水线抽象。虽然流水线是一种快速设置特定任务的语言模型的方法,但稍微低级的抽象模型和分词器(tokenizer)可以提供更多对选项的控制。我们将简要介绍如何使用这些抽象,按照以下步骤进行:

  • 给定输入

  • 对文章进行分词,将其转换为标记索引

  • 在经过分词的数据上应用模型,生成摘要(表示为标记索引)

  • 将摘要解码为可读的人类文本

现在我们来看一下用于分词器和模型类型的Auto*类,它们可以简化加载预训练的分词器和模型的过程。

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM       

# Load the pre-trained tokenizer and model.   
tokenizer = AutoTokenizer.from_pretrained("t5-small", cache_dir="/root/home/LLMs/week1")   
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small", cache_dir="/root/home/LLMs/week1")   
# For summarization, T5-small expects a prefix "summarize: ", so we prepend that to each article as a prompt.   
articles = list(map(lambda article: "summarize: " + article, xsum_sample["document"]))   
display(pd.DataFrame(articles, columns=["prompts"]))   
# Tokenize the input   
inputs = tokenizer(    
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True   
)   
print("input_ids:")   
print(inputs["input_ids"])   
print("attention_mask:")   
print(inputs["attention_mask"])      
# Generate summaries   
summary_ids = model.generate(     
    inputs.input_ids,       
    attention_mask=inputs.attention_mask,       
    num_beams=2,       
    min_length=0,       
    max_length=40,   
)   
print(summary_ids)   
# Decode the generated summaries   
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)   
display(pd.DataFrame(decoded_summaries, columns=["decoded_summaries"]))   

特定模型的分词器和模型加载器

您还可以手动选择加载特定的分词器和模型类型,而不必依赖于Auto*类来自动选择合适的类型。这样可以更灵活地控制系统的行为。

from transformers import T5Tokenizer, T5ForConditionalGeneration       

tokenizer = T5Tokenizer.from_pretrained("t5-small", cache_dir="/root/home/LLMs/week1")   
model = T5ForConditionalGeneration.from_pretrained(   
    "t5-small", cache_dir="/root/home/LLMs/week1"   
)   
# The tokenizer and model can then be used similarly to how we used the ones loaded by the Auto* classes.   
inputs = tokenizer(  
    articles, max_length=1024, return_tensors="pt", padding=True, truncation=True   
)   
summary_ids = model.generate(     
    inputs.input_ids,       
    attention_mask=inputs.attention_mask,       
    num_beams=2,       
    min_length=0,       
    max_length=40,   
)   
decoded_summaries = tokenizer.batch_decode(summary_ids, skip_special_tokens=True)       

display(pd.DataFrame(decoded_summaries, columns=["decoded_summaries"]))   

总结

在本文中,我们已经介绍了一些常见的语言模型应用,并学习了如何利用Hugging Face Hub中的预训练模型快速开始这些应用。此外,我们还了解了如何调整一些配置选项,以便更好地满足特定需求。通过这些知识,您可以更加灵活地应用语言模型,并获得更好的结果。

如何学习大模型

现在社会上大模型越来越普及了,已经有很多人都想往这里面扎,但是却找不到适合的方法去学习。

作为一名资深码农,初入大模型时也吃了很多亏,踩了无数坑。现在我想把我的经验和知识分享给你们,帮助你们学习AI大模型,能够解决你们学习中的困难。

我已将重要的AI大模型资料包括市面上AI大模型各大白皮书、AGI大模型系统学习路线、AI大模型视频教程、实战学习,等录播视频免费分享出来,需要的小伙伴可以扫取。

一、AGI大模型系统学习路线

很多人学习大模型的时候没有方向,东学一点西学一点,像只无头苍蝇乱撞,我下面分享的这个学习路线希望能够帮助到你们学习AI大模型。

在这里插入图片描述

二、AI大模型视频教程

在这里插入图片描述

三、AI大模型各大学习书籍

在这里插入图片描述

四、AI大模型各大场景实战案例

在这里插入图片描述

五、结束语

学习AI大模型是当前科技发展的趋势,它不仅能够为我们提供更多的机会和挑战,还能够让我们更好地理解和应用人工智能技术。通过学习AI大模型,我们可以深入了解深度学习、神经网络等核心概念,并将其应用于自然语言处理、计算机视觉、语音识别等领域。同时,掌握AI大模型还能够为我们的职业发展增添竞争力,成为未来技术领域的领导者。

再者,学习AI大模型也能为我们自己创造更多的价值,提供更多的岗位以及副业创收,让自己的生活更上一层楼。

因此,学习AI大模型是一项有前景且值得投入的时间和精力的重要选择。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2046144.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

初步融合snowboy+pyttsx3+espeak+sherpa-ncnn的python代码

在前文《将Snowboy语音唤醒的“叮”一声改成自定义语言》中,我已经实现唤醒snowboy后,树莓派会说一句自定义文本。今天,会在此基础上增加ASR的应用(基于sherpa-ncnn)。 首先,编写一个asr.py的程序&#xf…

手撕快排——三种实现方法(附动图及源码)

🤖💻👨‍💻👩‍💻🌟🚀 🤖🌟 欢迎降临张有志的未来科技实验室🤖🌟 专栏:数据结构 👨‍💻&…

【C++】STL——list

前言 本篇博客我们接着来理解一个STL库里的list链表的结构,根据前面数据结构的铺垫,理解这个结构相对比较容易。我们来一起看看吧 💓 个人主页:小张同学zkf ⏩ 文章专栏:C 若有问题 评论区见📝 &#x1f38…

中国与中南半岛国家多国语言系统开发i18n配置老挝、柬埔寨语言配置

前言 当下中国与中南半岛国家经济合作密切,同时也需要软件系统,多国使用系统需要实现多语言,我们团队最近也接到一个中、老、柬三国的业务软件,需要将软件做成三个国家语言。然后我们网上收i18n的老、柬的语言包命名,…

计算机毕业设计 美妆神域网站 美妆商城系统 Java+SpringBoot+Vue 前后端分离 文档报告 代码讲解 安装调试

🍊作者:计算机编程-吉哥 🍊简介:专业从事JavaWeb程序开发,微信小程序开发,定制化项目、 源码、代码讲解、文档撰写、ppt制作。做自己喜欢的事,生活就是快乐的。 🍊心愿:点…

j2:基于pytorch的resnet实验:鸟类分类

基于pytorch的resnet实验:鸟类分类 🍨 本文为🔗365天深度学习训练营 中的学习记录博客🍖 原作者:K同学啊 Ⅰ Ⅰ Ⅰ Introduction: 本文为机器学习使用resnet实现鸟类图片分类的实验,素材来自网…

跟李沐学AI:目标检测的常用算法

区域神经网络R-CNN 使用启发式搜索算法来选择锚框 -> 使用预训练模型来对每个锚框抽取特征 -> 训练一个SVM对类别进行分类 -> 训练一个线性回归模型来预测边缘框偏移 锚框大小不一,如何将不同的锚框统一为一个batch? -> 兴趣区域池化层 兴趣区域(RoI…

界面优化 - QSS

目录 1、背景介绍 2、基本语法 3、QSS 设置方式 3.1 指定控件样式设置 代码示例: 子元素受到影响 3.2 全局样式设置 代码示例: 使用全局样式 代码示例: 样式的层叠特性 代码示例: 样式的优先级 3.3 从文件加载样式表 代码示例: 从文件加载全局样式 3.4 使用 Qt Desi…

最新UI六零导航系统源码 | 多模版全开源

六零导航页 (LyLme Spage) 致力于简洁高效无广告的上网导航和搜索入口,支持后台添加链接、自定义搜索引擎,沉淀最具价值链接,全站无商业推广,简约而不简单。 使用PHPMySql,增加后台管理 多模板选择,支持在…

MySQL基础练习题46-每位经理的下属员工数量

目录 题目 准备数据 分析数据 总结 题目 我们将至少有一个其他员工需要向他汇报的员工,视为一个经理。 返回需要听取汇报的所有经理的 ID、名称、直接向该经理汇报的员工人数,以及这些员工的平均年龄,其中该平均年龄需要四舍五入到最接近…

【网络】IP分片与路径MTU发现

目录 MTU值 IP分片与重组 路径MTU发现 路径MTU发现原理 个人主页:东洛的克莱斯韦克-CSDN博客 相关文章:【网络】从零认识IPv4-CSDN博客 MTU值 由于物理层的硬件限制,为了使网络性能最优,在数据链路层会有一个MTU值&#xff0…

算法【Java】—— 双指针算法

双指针算法 常见的双指针有对撞指针,快慢指针以及前后指针(这个前后指针是指两个指针都是从从一个方向出发,去往另一个方法,也可以认为是小学学习过的两车并行,我也会叫做同向指针),在前后指针…

Python3网络爬虫开发实战(10)模拟登录(需补充账号池的构建)

文章目录 一、基于 Cookie 的模拟登录二、基于 JWT 模拟登入三、账号池四、基于 Cookie 模拟登录爬取实战五、基于JWT 的模拟登录爬取实战六、构建账号池 很多情况下,网站的一些数据需要登录才能查看,如果需要爬取这部分的数据,就需要实现模拟…

KNN图像识别实例--手写数字识别

目录 前言 一、导入库 二、导入图像并处理 1.导入图像 2.提取出图像中的数字 3.将列表转换成数组 4.获取特征数据集 5.获取标签数据 三、使用KNN模型 1.创建KNN模型并训练 2.KNN模型出厂前测试 3.使用测试集对KNN模型进行测试 四、传入单个图像,使用该模…

叉车高位盲区显示器 无线摄像头免打孔 视线遮挡的解决方案

叉车作业货叉叉货时,货叉升降无法看清位置,特别是仓储的堆高车,司机把头探出去才勉强可以靠经验找准方位!一个不小心就可能叉歪了,使货物倾斜、跌落等等,从而发生事故!如何将隐患扼杀&#xff0…

【JAVA入门】Day21 - 时间类

【JAVA入门】Day21 - 时间类 文章目录 【JAVA入门】Day21 - 时间类一、JDK7前的时间相关类1.1 Date1.2 SimpleDateFormat1.3 Calendar 二、JDK8新增的时间相关类2.1 Date 相关类2.1.1 ZoneId 时区2.1.2 Instant 时间戳2.1.3 ZoneDateTime 带时区的时间 2.2 DateTimeFormat 相关…

刷题DAY7

三个数的排序 题目:输入三个整数x,y,z,请把这三个数由小到大输出 输入:输入数据包含3个整数x,y,z,分别用逗号隔开 输出:输出由小到大排序后的结果,用空格隔…

O2OA开发知识-后端代理/接口脚本编写也能像前端一样用上debugger

在o2oa开发平台中,后端代理或者接口的脚本编写也能像前端一样用上debugger,这是来自藕粉社区用户的宝贵技术支持。 感谢藕粉社区论坛用户提供的技术分享!tzengsh_BTstthttps://www.o2oa.net/forum/space-uid-4410.html 论坛地址&#xff1a…

【Kubernetes】k8s集群图形化管理工具之rancher

目录 一.Rancher概述 1.Rancher简介 2.Rancher与k8s的关系及区别 3.Rancher具有的优势 二.Rancher的安装部署 1.实验准备 2.安装 rancher 3.rancher的浏览器使用 一.Rancher概述 1.Rancher简介 Rancher 是一个开源的企业级多集群 Kubernetes 管理平台,实…

2024年高教社杯数学建模国赛A题思路解析+代码+论文

2024年高教社杯全国大学生数学建模竞赛(以下简称国赛)将于9月5日晚6时正式开始。 下文包含:2024国赛思路解析​、国赛参赛时间及规则信息说明、好用的数模技巧及如何备战数学建模竞赛 C君将会第一时间发布选题建议、所有题目的思路解析、相…