HuggingFace团队亲授：如何使用HuggingFace上的开源模型

Open Source Models with Hugging Face

本文是 https://www.deeplearning.ai/short-courses/open-source-models-hugging-face/ 这门课程的学习笔记。

在这里插入图片描述

文章目录

Open Source Models with Hugging Face
- What you’ll learn in this course
Lesson 1: Selecting Models
Lesson 2: Natural Language Processing (NLP)
- - Build the `chatbot` pipeline using 🤗 Transformers Library
Lesson 3: Translation and Summarization
- - Build the `translation` pipeline using 🤗 Transformers Library
- Free up some memory before continuing
- - Build the `summarization` pipeline using 🤗 Transformers Library
Lesson 4: Sentence Embeddings
- - Build the `sentence embedding` pipeline using 🤗 Transformers Library
Lesson 5: Zero-Shot Audio Classification
- - Prepare the dataset of audio recordings
  - Build the `audio classification` pipeline using 🤗 Transformers Library
  - Sampling Rate for Transformer Models
Lesson 6: Automatic Speech Recognition
- - Data preparation
  - Build the pipeline
  - Build a shareable app with Gradio
  - Troubleshooting Tip
Lesson 7: Text to Speech
- - Build the `text-to-speech` pipeline using the 🤗 Transformers Library
Lesson 8: Object Detection
- - Build the `object-detection` pipeline using 🤗 Transformers Library
  - Use the Pipeline
  - Using `Gradio` as a Simple Interface
  - Make an AI Powered Audio Assistant
  - Generate Audio Narration of an Image
  - Play the Generated Audio
Lesson 9: Segmentation
- - Mask Generation with SAM
  - Faster Inference: Infer an Image and a Single Point
- Depth Estimation with DPT
- - Demo using Gradio
Lesson 10: Image Retrieval
- - Test, if the image matches the text
Lesson 11: Image Captioning
- - Conditional Image Captioning
  - Unconditional Image Captioning
Lesson 12: Visual Question & Answering
Lesson 13: Zero-Shot Image Classification
后记

HuggingFace中的开源模型：NLP，语音识别，目标检测，多模态等

What you’ll learn in this course

The availability of models and their weights for anyone to download enables a broader range of developers to innovate and create.

In this course, you’ll select open source models from Hugging Face Hub to perform NLP, audio, image and multimodal tasks using the Hugging Face transformers library. Easily package your code into a user-friendly app that you can run on the cloud using Gradio and Hugging Face Spaces.

You will:

Use the transformers library to turn a small language model into a chatbot capable of multi-turn conversations to answer follow-up questions.
Translate between languages, summarize documents, and measure the similarity between two pieces of text, which can be used for search and retrieval.
Convert audio to text with Automatic Speech Recognition (ASR), and convert text to audio using Text to Speech (TTS).
Perform zero-shot audio classification, to classify audio without fine-tuning the model.
Generate an audio narration describing an image by combining object detection and text-to-speech models.
Identify objects or regions in an image by prompting a zero-shot image segmentation model with points to identify the object that you want to select.
Implement visual question answering, image search, image captioning and other multimodal tasks.
Share your AI app using Gradio and Hugging Face Spaces to run your applications in a user-friendly interface on the cloud or as an API.

The course will provide you with the building blocks that you can combine into a pipeline to build your AI-enabled applications!

Lesson 1: Selecting Models

transformers库中有各种模型可供选择。

Lesson 2: Natural Language Processing (NLP)

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers

Build the `chatbot` pipeline using 🤗 Transformers Library

Here is some code that suppresses warning messages.

from transformers.utils import logging
logging.set_verbosity_error()
from transformers import pipeline

Define the conversation pipeline

chatbot = pipeline(task="conversational",
                   model="./models/facebook/blenderbot-400M-distill")

user_message = """
What are some fun activities I can do in the winter?
"""

from transformers import Conversation

conversation = Conversation(user_message)

print(conversation)
conversation = chatbot(conversation)
print(conversation)

Output

Conversation id: 7a978de5-931d-4f62-8a19-77c997d93c7b
user: 
What are some fun activities I can do in the winter?

assistant:  I like snowboarding and skiing.  What do you like to do in winter?

You can continue the conversation with the chatbot with:

print(chatbot(Conversation("What else do you recommend?")))

However, the chatbot may provide an unrelated response because it does not have memory of any prior conversations.
To include prior conversations in the LLM’s context, you can add a ‘message’ to include the previous chat history.

conversation.add_message(
    {"role": "user",
     "content": """
What else do you recommend?
"""
    })

print(conversation)

conversation = chatbot(conversation)

print(conversation)

Output

Conversation id: 7a978de5-931d-4f62-8a19-77c997d93c7b
user: 
What are some fun activities I can do in the winter?

assistant:  I like snowboarding and skiing.  What do you like to do in winter?
user: 
What else do you recommend?

assistant:  Snowboarding is a lot of fun.  You can do it indoors or outdoors.

Open LLM Leaderboard
LMSYS Chatbot Arena Leaderboard

Lesson 3: Translation and Summarization

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers 
    !pip install torch

Here is some code that suppresses warning messages.

from transformers.utils import logging
logging.set_verbosity_error()

Build the `translation` pipeline using 🤗 Transformers Library

from transformers import pipeline 
import torch

translator = pipeline(task="translation",
                      model="./models/facebook/nllb-200-distilled-600M",
                      torch_dtype=torch.bfloat16)

text = """\
My puppy is adorable, \
Your kitten is cute.
Her panda is friendly.
His llama is thoughtful. \
We all have nice pets!"""

text_translated = translator(text,
                             src_lang="eng_Latn",
                             tgt_lang="zho_Hans")

To choose other languages, you can find the other language codes on the page: Languages in FLORES-200

For example:

Afrikaans: afr_Latn
Chinese: zho_Hans
Egyptian Arabic: arz_Arab
French: fra_Latn
German: deu_Latn
Greek: ell_Grek
Hindi: hin_Deva
Indonesian: ind_Latn
Italian: ita_Latn
Japanese: jpn_Jpan
Korean: kor_Hang
Persian: pes_Arab
Portuguese: por_Latn
Russian: rus_Cyrl
Spanish: spa_Latn
Swahili: swh_Latn
Thai: tha_Thai
Turkish: tur_Latn
Vietnamese: vie_Latn
Zulu: zul_Latn

text_translated

Output

[{'translation_text': '我的狗很可爱,你的小猫很可爱,她的熊猫很友好,他的拉马很有心情.我们都有好物!'}]

Free up some memory before continuing

In order to have enough free memory to run the rest of the code, please run the following to free up memory on the machine.

import gc
del translator
gc.collect()

Build the `summarization` pipeline using 🤗 Transformers Library

summarizer = pipeline(task="summarization",
                      model="./models/facebook/bart-large-cnn",
                      torch_dtype=torch.bfloat16)

text = """Paris is the capital and most populous city of France, with
          an estimated population of 2,175,601 residents as of 2018,
          in an area of more than 105 square kilometres (41 square
          miles). The City of Paris is the centre and seat of
          government of the region and province of Île-de-France, or
          Paris Region, which has an estimated population of
          12,174,880, or about 18 percent of the population of France
          as of 2017."""

summary = summarizer(text,
                     min_length=10,
                     max_length=100)

summary

Output

[{'summary_text': 'Paris is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018. The City of Paris is the centre and seat of the government of the region and province of Île-de-France.'}]

Lesson 4: Sentence Embeddings

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install sentence-transformers

Build the `sentence embedding` pipeline using 🤗 Transformers Library

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentences1 = ['The cat sits outside',
              'A man is playing guitar',
              'The movies are awesome']
              
embeddings1 = model.encode(sentences1, convert_to_tensor=True)

embeddings1
print(embeddings1.shape)

Output

tensor([[ 0.1392,  0.0030,  0.0470,  ...,  0.0641, -0.0163,  0.0636],
        [ 0.0227, -0.0014, -0.0056,  ..., -0.0225,  0.0846, -0.0283],
        [-0.1043, -0.0628,  0.0093,  ...,  0.0020,  0.0653, -0.0150]])

torch.Size([3, 384])

sentences2 = ['The dog plays in the garden',
              'A woman watches TV',
              'The new movie is so great']

embeddings2 = model.encode(sentences2, 
                           convert_to_tensor=True)

print(embeddings2)

Output

tensor([[ 0.0163, -0.0700,  0.0384,  ...,  0.0447,  0.0254, -0.0023],
        [ 0.0054, -0.0920,  0.0140,  ...,  0.0167, -0.0086, -0.0424],
        [-0.0842, -0.0592, -0.0010,  ..., -0.0157,  0.0764,  0.0389]])

Calculate the cosine similarity between two sentences as a measure of how similar they are to each other.

from sentence_transformers import util
cosine_scores = util.cos_sim(embeddings1,embeddings2)
print(cosine_scores)

Output

tensor([[ 0.2838,  0.1310, -0.0029],
        [ 0.2277, -0.0327, -0.0136],
        [-0.0124, -0.0465,  0.6571]])

这个结果中的每个值表示两个句子之间的余弦相似度分数。给定两个句子embedding向量，余弦相似度衡量了它们之间的相似程度，值在-1到1之间，其中1表示完全相似，-1表示完全不相似，0表示没有相似性。

在你的例子中，cosine_scores是一个形状为(3,3)的张量，因为有3个句子对应于embeddings1中的3个句子和embeddings2中的3个句子。所以，结果中的每一行对应于embeddings1中的一个句子与embeddings2中的所有句子之间的余弦相似度。

以下是每个值对应的句子之间的余弦相似度：

第一行：sentences1[0] (“The cat sits outside”) 与 sentences2 中每个句子之间的余弦相似度。
第二行：sentences1[1] (“A man is playing guitar”) 与 sentences2 中每个句子之间的余弦相似度。
第三行：sentences1[2] (“The movies are awesome”) 与 sentences2 中每个句子之间的余弦相似度。

因此，例如，结果中的第一个值0.2838表示sentences1中的第一个句子与sentences2中的第一个句子之间的余弦相似度。

for i in range(len(sentences1)):
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences1[i],
                                                 sentences2[i],
                                                 cosine_scores[i][i]))

Output

The cat sits outside 		 The dog plays in the garden 		 Score: 0.2838
A man is playing guitar 		 A woman watches TV 		 Score: -0.0327
The movies are awesome 		 The new movie is so great 		 Score: 0.6571

Lesson 5: Zero-Shot Audio Classification

In the classroom, the libraries have already been installed for you.
If you are running this code on your own machine, please install the following:

    !pip install transformers
    !pip install datasets
    !pip install soundfile
    !pip install librosa

The librosa library may need to have ffmpeg installed.

This page on librosa provides installation instructions for ffmpeg.

Prepare the dataset of audio recordings

from datasets import load_dataset, load_from_disk

# This dataset is a collection of different sounds of 5 seconds
# dataset = load_dataset("ashraq/esc50",
#                       split="train[0:10]")
dataset = load_from_disk("./models/ashraq/esc50/train")

audio_sample = dataset[0]
audio_sample

Output

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 44100}}

from IPython.display import Audio as IPythonAudio
IPythonAudio(audio_sample["audio"]["array"],
             rate=audio_sample["audio"]["sampling_rate"])

Build the `audio classification` pipeline using 🤗 Transformers Library

from transformers import pipeline

zero_shot_classifier = pipeline(
    task="zero-shot-audio-classification",
    model="./models/laion/clap-htsat-unfused")

Sampling Rate for Transformer Models

在这里插入图片描述

这个解释是在说明对于一个在16 kHz音频上训练的Transformer模型，一个包含960,000个值的数组看起来像是一个60秒长的16 kHz音频记录。这个解释涉及到音频信号的采样率和持续时间的关系。

在数字信号处理中，采样率表示在一秒钟内对信号进行采样的次数。16 kHz的采样率意味着在一秒钟内对信号进行了16,000次采样。

一个包含960,000个值的数组对应着一个60秒长的音频记录，因为960,000除以16,000等于60。所以，这个数组中的每个值代表着音频信号在每个采样点上的振幅或能量等信息，通过对这些值进行处理，可以分析和重构原始的音频信号。

How long does 1 second of high resolution audio (192,000 Hz) appear to the Whisper model (which is trained to expect audio files at 16,000 Hz)?

(1 * 192000) / 16000 = 12

The 1 second of high resolution audio appears to the model as if it is 12 seconds of audio.
How about 5 seconds of audio?

(5 * 192000) / 16000 = 60

5 seconds of high resolution audio appears to the model as if it is 60 seconds of audio.

zero_shot_classifier.feature_extractor.sampling_rate # 48000
audio_sample["audio"]["sampling_rate"] # 44100

Set the correct sampling rate for the input and the model.

from datasets import Audio
dataset = dataset.cast_column(
    "audio",
     Audio(sampling_rate=48_000))
     
 audio_sample = dataset[0]
 
 audio_sample

Output

{'filename': '1-100032-A-0.wav',
 'fold': 1,
 'target': 0,
 'category': 'dog',
 'esc10': True,
 'src_file': 100032,
 'take': 'A',
 'audio': {'path': None,
  'array': array([0., 0., 0., ..., 0., 0., 0.]),
  'sampling_rate': 48000}}

candidate_labels = ["Sound of a dog",
                    "Sound of vacuum cleaner"]
                    
zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

Output

[{'score': 0.9985589385032654, 'label': 'Sound of a dog'},
 {'score': 0.0014411062002182007, 'label': 'Sound of vacuum cleaner'}]

candidate_labels = ["Sound of a child crying",
                    "Sound of vacuum cleaner",
                    "Sound of a bird singing",
                    "Sound of an airplane"]

zero_shot_classifier(audio_sample["audio"]["array"],
                     candidate_labels=candidate_labels)

Output

[{'score': 0.6172538995742798, 'label': 'Sound of a bird singing'},
 {'score': 0.21602486073970795, 'label': 'Sound of vacuum cleaner'},
 {'score': 0.1254722625017166, 'label': 'Sound of an airplane'},
 {'score': 0.041249003261327744, 'label': 'Sound of a child crying'}]

Lesson 6: Automatic Speech Recognition

Data preparation

from datasets import load_dataset
dataset = load_dataset("librispeech_asr",
                       split="train.clean.100",
                       streaming=True,
                       trust_remote_code=True)

example = next(iter(dataset))
dataset_head = dataset.take(5)
list(dataset_head)

Output

[{'file': '374-180298-0000.flac',
  'audio': {'path': '374-180298-0000.flac',
   'array': array([ 7.01904297e-04,  7.32421875e-04,  7.32421875e-04, ...,
          -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
   'sampling_rate': 16000},
  'text': 'CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED',
  'speaker_id': 374,
  'chapter_id': 180298,
  'id': '374-180298-0000'},
 {'file': '374-180298-0001.flac',
  'audio': {'path': '374-180298-0001.flac',
   'array': array([-9.15527344e-05, -1.52587891e-04, -1.52587891e-04, ...,
          -2.13623047e-04, -1.83105469e-04, -2.74658203e-04]),
   'sampling_rate': 16000},
  'text': "MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS'S LIFE I CHANGED MY OWN",
  'speaker_id': 374,
  'chapter_id': 180298,
  'id': '374-180298-0001'},
 {'file': '374-180298-0002.flac',
  'audio': {'path': '374-180298-0002.flac',
   'array': array([-2.44140625e-04, -2.44140625e-04, -1.83105469e-04, ...,
           1.83105469e-04,  3.05175781e-05, -1.52587891e-04]),
   'sampling_rate': 16000},
  'text': 'I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM',
  'speaker_id': 374,
  'chapter_id': 180298,
  'id': '374-180298-0002'},
 {'file': '374-180298-0003.flac',
  'audio': {'path': '374-180298-0003.flac',
   'array': array([-0.00024414, -0.00039673, -0.00057983, ...,  0.00018311,
           0.00024414,  0.00024414]),
   'sampling_rate': 16000},
  'text': 'ASSUMED ALL AT ONCE AN APPEARANCE OF NOISE AND DISORDER NEVER BELIEVE HOWEVER DISINTERESTED THE LOVE OF A KEPT WOMAN MAY BE THAT IT WILL COST ONE NOTHING',
  'speaker_id': 374,
  'chapter_id': 180298,
  'id': '374-180298-0003'},
 {'file': '374-180298-0004.flac',
  'audio': {'path': '374-180298-0004.flac',
   'array': array([0.00027466, 0.00030518, 0.00021362, ..., 0.00015259, 0.00015259,
          0.00015259]),
   'sampling_rate': 16000},
  'text': "NOTHING IS SO EXPENSIVE AS THEIR CAPRICES FLOWERS BOXES AT THE THEATRE SUPPERS DAYS IN THE COUNTRY WHICH ONE CAN NEVER REFUSE TO ONE'S MISTRESS AS I HAVE TOLD YOU I HAD LITTLE MONEY",
  'speaker_id': 374,
  'chapter_id': 180298,
  'id': '374-180298-0004'}]

list(dataset_head)[2]

Output

{'file': '374-180298-0002.flac',
 'audio': {'path': '374-180298-0002.flac',
  'array': array([-2.44140625e-04, -2.44140625e-04, -1.83105469e-04, ...,
          1.83105469e-04,  3.05175781e-05, -1.52587891e-04]),
  'sampling_rate': 16000},
 'text': 'I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM',
 'speaker_id': 374,
 'chapter_id': 180298,
 'id': '374-180298-0002'}

example

Output

{'file': '374-180298-0000.flac',
 'audio': {'path': '374-180298-0000.flac',
  'array': array([ 7.01904297e-04,  7.32421875e-04,  7.32421875e-04, ...,
         -2.74658203e-04, -1.83105469e-04, -3.05175781e-05]),
  'sampling_rate': 16000},
 'text': 'CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED',
 'speaker_id': 374,
 'chapter_id': 180298,
 'id': '374-180298-0000'}

from IPython.display import Audio as IPythonAudio

IPythonAudio(example["audio"]["array"],
             rate=example["audio"]["sampling_rate"])

Build the pipeline

from transformers import pipeline
asr = pipeline(task="automatic-speech-recognition",
               model="./models/distil-whisper/distil-small.en")

asr.feature_extractor.sampling_rate # 16000
example['audio']['sampling_rate'] # 16000

asr(example["audio"]["array"])

Output

{'text': ' Chapter 16 I might have told you of the beginning of this liaison in a few lines, but I wanted you to see every step by which we came. I too agree to whatever Marguerite wished.'}

example["text"]

Output

'CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED'

Build a shareable app with Gradio

Troubleshooting Tip

Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.
- This is specific to this classroom environment when it’s serving many learners at once, and you won’t wouldn’t experience this issue if you run this code on your own machine.
To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson.

import os
import gradio as gr
demo = gr.Blocks()

def transcribe_speech(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(filepath)
    return output["text"]
    
    
mic_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

To learn more about building apps with Gradio, you can check out the short course: Building Generative AI Applications with Gradio, also taught by Hugging Face.

file_transcribe = gr.Interface(
    fn=transcribe_speech,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )

demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

Output

在这里插入图片描述

demo.close()

Output

Closing server running on port: 45227

Convert the audio from stereo to mono (Using librosa)

audio.shape # (3572352, 2)

import numpy as np

audio_transposed = np.transpose(audio)

audio_transposed.shape # (2, 3572352)

import librosa

audio_mono = librosa.to_mono(audio_transposed)

IPythonAudio(audio_mono,
             rate=sampling_rate)

asr(audio_mono)

Output

{'text': " I'm sorry. or two. The problem is, is that you want to Yeah. You. Yeah. A the next week. The world is It's true. The war has become a huge and broad-world size. The war is there. What I can hear. The war is there. The song is, It's a Little Shoo's Talking. And then, How How How How? I'm sorry The story is about the record watching the watch again. The B."}

Warning: The cell above might throw a warning because the sample rate of the audio sample is not the same of the sample rate of the model.

Let’s check and fix this!

sampling_rate # 44100
asr.feature_extractor.sampling_rate # 16000

audio_16KHz = librosa.resample(audio_mono,
                               orig_sr=sampling_rate,
                               target_sr=16000)
                               
asr(
    audio_16KHz,
    chunk_length_s=30, # 30 seconds
    batch_size=4,
    return_timestamps=True,
)["chunks"]

Output

[{'timestamp': (0.0, 13.0),
  'text': ' They run. They laugh. I see the glow shining on their eyes. Not like hers. She seems distant, strange, somehow cold.'},
 {'timestamp': (13.0, 27.0),
  'text': " A couple of days after, I receive the call. I curse, scream, and cry. They're gone. I drink and cry and dream over and over. Why?"},
 {'timestamp': (27.0, 33.0),
  'text': ' Time drags me, expending days, months, or maybe years.'},
 {'timestamp': (33.0, 39.0),
  'text': ' But the pain still remains. It grows. It changes me.'},
 {'timestamp': (39.0, 43.0),
  'text': ' Someone tells me she got released from the psychiatric ward.'},
 {'timestamp': (43.0, 46.08),
  'text': ' 426 days after. I got confused. the psychiatric ward. 426 days after.'},
 {'timestamp': (46.08, 47.08), 'text': ' My head spins.'},
 {'timestamp': (47.08, 49.4), 'text': ' I got confused.'},
 {'timestamp': (49.4, 51.08), 'text': ' The loneliness.'},
 {'timestamp': (51.08, 52.56), 'text': " It's time."},
 {'timestamp': (52.56, 55.04), 'text': ' The road has become endless.'},
 {'timestamp': (55.04, 57.4), 'text': ' I feel the cold wind on my face.'},
 {'timestamp': (57.4, 59.52), 'text': ' My eyes burn.'},
 {'timestamp': (59.52, 61.08), 'text': ' I get to the house.'},
 {'timestamp': (61.08, 62.76), 'text': ' It all looks the same.'},
 {'timestamp': (62.76, 66.84),
  'text': ' I can hear them, laughing like there were no souls taken.'},
 {'timestamp': (66.84, 73.36),
  'text': ' And then she comes. She sees me with kindness in her eyes. She looks at the flowers and'},
 {'timestamp': (73.36, 80.44),
  'text': ' she says she still loves me. Those words hurt me like a razor blade. Good bye, my love.'}]

Build the Gradio interface.

import gradio as gr
demo = gr.Blocks()

def transcribe_long_form(filepath):
    if filepath is None:
        gr.Warning("No audio found, please retry.")
        return ""
    output = asr(
      filepath,
      max_new_tokens=256,
      chunk_length_s=30,
      batch_size=8,
    )
    return output["text"]
    
    
mic_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="microphone",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never")

file_transcribe = gr.Interface(
    fn=transcribe_long_form,
    inputs=gr.Audio(sources="upload",
                    type="filepath"),
    outputs=gr.Textbox(label="Transcription",
                       lines=3),
    allow_flagging="never",
)

with demo:
    gr.TabbedInterface(
        [mic_transcribe,
         file_transcribe],
        ["Transcribe Microphone",
         "Transcribe Audio File"],
    )
demo.launch(share=True, 
            server_port=int(os.environ['PORT1']))

Output

在这里插入图片描述

Lesson 7: Text to Speech

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Note: py-espeak-ng is only available Linux operating systems.

To run locally in a Linux machine, follow these commands:

    sudo apt-get update
    sudo apt-get install espeak-ng
    pip install py-espeak-ng

Build the `text-to-speech` pipeline using the 🤗 Transformers Library

from transformers import pipeline

narrator = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

text = """
Researchers at the Allen Institute for AI, \
HuggingFace, Microsoft, the University of Washington, \
Carnegie Mellon University, and the Hebrew University of \
Jerusalem developed a tool that measures atmospheric \
carbon emitted by cloud servers while training machine \
learning models. After a model’s size, the biggest variables \
were the server’s location and time of day it was active.
"""

narrated_text = narrator(text)

from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

Lesson 8: Object Detection

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install inflect
    !pip install phonemizer

Build the `object-detection` pipeline using 🤗 Transformers Library

This model was release with the paper End-to-End Object Detection with Transformers from Carion et al. (2020)

from helper import load_image_from_url, render_results_in_image
from transformers import pipeline

from transformers.utils import logging
logging.set_verbosity_error()

from helper import ignore_warnings
ignore_warnings()

od_pipe = pipeline("object-detection", "./models/facebook/detr-resnet-50")

Use the Pipeline

from PIL import Image

raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((569, 491))

pipeline_output = od_pipe(raw_image)

Return the results from the pipeline using the helper function render_results_in_image.

processed_image = render_results_in_image(
    raw_image, 
    pipeline_output)
    
processed_image

Output

在这里插入图片描述

补充显示图片的函数

def render_results_in_image(in_pil_img, in_results):
    plt.figure(figsize=(16, 10))
    plt.imshow(in_pil_img)

    ax = plt.gca()

    for prediction in in_results:

        x, y = prediction['box']['xmin'], prediction['box']['ymin']
        w = prediction['box']['xmax'] - prediction['box']['xmin']
        h = prediction['box']['ymax'] - prediction['box']['ymin']

        ax.add_patch(plt.Rectangle((x, y),
                                   w,
                                   h,
                                   fill=False,
                                   color="green",
                                   linewidth=2))
        ax.text(
           x,
           y,
           f"{prediction['label']}: {round(prediction['score']*100, 1)}%",
           color='red'
        )

    plt.axis("off")

    # Save the modified image to a BytesIO object
    img_buf = io.BytesIO()
    plt.savefig(img_buf, format='png',
                bbox_inches='tight',
                pad_inches=0)
    img_buf.seek(0)
    modified_image = Image.open(img_buf)

    # Close the plot to prevent it from being displayed
    plt.close()

    return modified_image

Using `Gradio` as a Simple Interface

Use Gradio to create a demo for the object detection app.
The demo makes it look friendly and easy to use.
You can share the demo with your friends and colleagues as well.

import os
import gradio as gr

def get_pipeline_prediction(pil_image):
    
    pipeline_output = od_pipe(pil_image)
    
    processed_image = render_results_in_image(pil_image,
                                            pipeline_output)
    return processed_image
    
    
demo = gr.Interface(
  fn=get_pipeline_prediction,
  inputs=gr.Image(label="Input image", 
                  type="pil"),
  outputs=gr.Image(label="Output image with predicted instances",
                   type="pil")
)

share=True will provide an online link to access to the demo

demo.launch(share=True, server_port=int(os.environ['PORT1']))

Output

在这里插入图片描述

Make an AI Powered Audio Assistant

Combine the object detector with a text-to-speech model that will help dictate what is inside the image.
Inspect the output of the object detection pipeline.

pipeline_output

Output

[{'score': 0.9856818318367004,
  'label': 'fork',
  'box': {'xmin': 808, 'ymin': 688, 'xmax': 836, 'ymax': 765}},
 {'score': 0.9904232025146484,
  'label': 'bottle',
  'box': {'xmin': 688, 'ymin': 667, 'xmax': 743, 'ymax': 789}},
 {'score': 0.9948464632034302,
  'label': 'cup',
  'box': {'xmin': 520, 'ymin': 770, 'xmax': 577, 'ymax': 863}},
 {'score': 0.9971936941146851,
  'label': 'person',
  'box': {'xmin': 778, 'ymin': 387, 'xmax': 1125, 'ymax': 972}},
 {'score': 0.9695369005203247,
  'label': 'bottle',
  'box': {'xmin': 465, 'ymin': 786, 'xmax': 527, 'ymax': 912}},
 {'score': 0.9300816059112549,
  'label': 'bowl',
  'box': {'xmin': 556, 'ymin': 739, 'xmax': 622, 'ymax': 779}},
 {'score': 0.9995697140693665,
  'label': 'person',
  'box': {'xmin': 231, 'ymin': 286, 'xmax': 510, 'ymax': 783}},
 {'score': 0.9992026686668396,
  'label': 'person',
  'box': {'xmin': 0, 'ymin': 338, 'xmax': 349, 'ymax': 974}},
 {'score': 0.9742276668548584,
  'label': 'dining table',
  'box': {'xmin': 167, 'ymin': 712, 'xmax': 873, 'ymax': 971}},
 {'score': 0.9756981730461121,
  'label': 'fork',
  'box': {'xmin': 243, 'ymin': 682, 'xmax': 298, 'ymax': 802}},
 {'score': 0.9946128129959106,
  'label': 'bottle',
  'box': {'xmin': 497, 'ymin': 681, 'xmax': 558, 'ymax': 824}},
 {'score': 0.9976205229759216,
  'label': 'cup',
  'box': {'xmin': 610, 'ymin': 715, 'xmax': 669, 'ymax': 814}},
 {'score': 0.993443489074707,
  'label': 'person',
  'box': {'xmin': 799, 'ymin': 469, 'xmax': 1049, 'ymax': 821}}]

raw_image = Image.open('huggingface_friends.jpg')
raw_image.resize((284, 245))

from helper import summarize_predictions_natural_language

text = summarize_predictions_natural_language(pipeline_output)

text

Output

'In this image, there are two forks three bottles two cups four persons one bowl and one dining table.'

summarize_predictions_natural_language函数的实现如下：

def summarize_predictions_natural_language(predictions):
    summary = {}
    p = inflect.engine()

    for prediction in predictions:
        label = prediction['label']
        if label in summary:
            summary[label] += 1
        else:
            summary[label] = 1

    result_string = "In this image, there are "
    for i, (label, count) in enumerate(summary.items()):
        count_string = p.number_to_words(count)
        result_string += f"{count_string} {label}"
        if count > 1:
          result_string += "s"

        result_string += " "

        if i == len(summary) - 2:
          result_string += "and "

    # Remove the trailing comma and space
    result_string = result_string.rstrip(', ') + "."

    return result_string

Generate Audio Narration of an Image

tts_pipe = pipeline("text-to-speech",
                    model="./models/kakao-enterprise/vits-ljs")

narrated_text = tts_pipe(text)

Play the Generated Audio

from IPython.display import Audio as IPythonAudio

IPythonAudio(narrated_text["audio"][0],
             rate=narrated_text["sampling_rate"])

Lesson 9: Segmentation

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers
    !pip install gradio
    !pip install timm
    !pip install torchvision

Mask Generation with SAM

The Segment Anything Model (SAM) model was released by Meta AI.

from transformers import pipeline

sam_pipe = pipeline("mask-generation",
    "./models/Zigeng/SlimSAM-uniform-77")

from PIL import Image
raw_image = Image.open('meta_llamas.jpg')
raw_image.resize((720, 375))

Output

在这里插入图片描述

Running this will take some time
The higher the value of ‘points_per_batch’, the more efficient pipeline inference will be

output = sam_pipe(raw_image, points_per_batch=32)

from helper import show_pipe_masks_on_image

show_pipe_masks_on_image(raw_image, output)

Output

在这里插入图片描述

Faster Inference: Infer an Image and a Single Point

from transformers import SamModel, SamProcessor

model = SamModel.from_pretrained(
    "./models/Zigeng/SlimSAM-uniform-77")

processor = SamProcessor.from_pretrained(
    "./models/Zigeng/SlimSAM-uniform-77")
    
raw_image.resize((720, 375))

Segment the blue shirt Andrew is wearing.
Give any single 2D point that would be in that region (blue shirt).

input_points = [[[1600, 700]]]

Create the input using the image and the single point.
return_tensors="pt" means to return PyTorch Tensors.

inputs = processor(raw_image,
                 input_points=input_points,
                 return_tensors="pt")

import torch

with torch.no_grad():
    outputs = model(**inputs)
    
predicted_masks = processor.image_processor.post_process_masks(
    outputs.pred_masks,
    inputs["original_sizes"],
    inputs["reshaped_input_sizes"]
)

Length of predicted_masks corresponds to the number of images that are used in the input.

len(predicted_masks)

Inspect the size of the first ([0]) predicted mask

predicted_mask = predicted_masks[0]
predicted_mask.shape # torch.Size([1, 3, 1500, 2880])
 
outputs.iou_scores # tensor([[[0.9583, 0.9551, 0.9580]]])
from helper import show_mask_on_image

for i in range(3):
    show_mask_on_image(raw_image, predicted_mask[:, i])

Output

在这里插入图片描述

Depth Estimation with DPT

This model was introduced in the paper Vision Transformers for Dense Prediction by Ranftl et al. (2021) and first released in isl-org/DPT.

depth_estimator = pipeline(task="depth-estimation",
                        model="./models/Intel/dpt-hybrid-midas")

raw_image = Image.open('gradio_tamagochi_vienna.png')
raw_image.resize((806, 621))

Output

在这里插入图片描述

output = depth_estimator(raw_image)

Post-process the output image to resize it to the size of the original image.

output["predicted_depth"].shape # torch.Size([1, 384, 384])
output["predicted_depth"].unsqueeze(1).shape # torch.Size([1, 1, 384, 384])

prediction = torch.nn.functional.interpolate(
    output["predicted_depth"].unsqueeze(1),
    size=raw_image.size[::-1],
    mode="bicubic",
    align_corners=False,
)

prediction.shape  # torch.Size([1, 1, 1242, 1612])
raw_image.size[::-1], # ((1242, 1612),)

prediction

import numpy as np 
output = prediction.squeeze().numpy()
formatted = (output * 255 / np.max(output)).astype("uint8")
depth = Image.fromarray(formatted)

depth

Output

在这里插入图片描述

Demo using Gradio

Troubleshooting Tip

Note, in the classroom, you may see the code for creating the Gradio app run indefinitely.
- This is specific to this classroom environment when it’s serving many learners at once, and you won’t wouldn’t experience this issue if you run this code on your own machine.
To fix this, please restart the kernel (Menu Kernel->Restart Kernel) and re-run the code in the lab from the beginning of the lesson.

import os
import gradio as gr
from transformers import pipeline

def launch(input_image):
    out = depth_estimator(input_image)

    # resize the prediction
    prediction = torch.nn.functional.interpolate(
        out["predicted_depth"].unsqueeze(1),
        size=input_image.size[::-1],
        mode="bicubic",
        align_corners=False,
    )

    # normalize the prediction
    output = prediction.squeeze().numpy()
    formatted = (output * 255 / np.max(output)).astype("uint8")
    depth = Image.fromarray(formatted)
    return depth
    
iface = gr.Interface(launch, 
                     inputs=gr.Image(type='pil'), 
                     outputs=gr.Image(type='pil'))
                     
iface.launch(share=True, server_port=int(os.environ['PORT1']))

Lesson 10: Image Retrieval

In the classroom, the libraries are already installed for you.
If you would like to run this code on your own machine, you can install the following:

    !pip install transformers
    !pip install torch

Load the model and the processor

from transformers import BlipForImageTextRetrieval
model = BlipForImageTextRetrieval.from_pretrained(
    "./models/Salesforce/blip-itm-base-coco")

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-itm-base-coco")

img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'

from PIL import Image
import requests

raw_image =  Image.open(
    requests.get(img_url, stream=True).raw).convert('RGB')

Output

在这里插入图片描述

Test, if the image matches the text

text = "an image of a woman and a dog on the beach"

inputs = processor(images=raw_image,
                   text=text,
                   return_tensors="pt")
                   
itm_scores = model(**inputs)[0]

itm_scores # tensor([[-2.2228,  2.2260]], grad_fn=<AddmmBackward0>)

Use a softmax layer to get the probabilities

import torch

itm_score = torch.nn.functional.softmax(
    itm_scores,dim=1)
    
itm_score

print(f"""\
The image and text are matched \
with a probability of {itm_score[0][1]:.4f}""")

Output

The image and text are matched with a probability of 0.9884

Lesson 11: Image Captioning

from transformers import BlipForConditionalGeneration
model = BlipForConditionalGeneration.from_pretrained(
    "./models/Salesforce/blip-image-captioning-base")
    
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-image-captioning-base")
    
from PIL import Image

image = Image.open("./beach.jpeg")

照片同lesson 10

Conditional Image Captioning

text = "a photograph of"
inputs = processor(image, text, return_tensors="pt")

out = model.generate(**inputs)

Output

tensor([[30522,  1037,  9982,  1997,  1037,  2450,  1998,  2014,  3899,  2006,
          1996,  3509,   102]])

print(processor.decode(out[0], skip_special_tokens=True))

Output

a photograph of a woman and her dog on the beach

Unconditional Image Captioning

inputs = processor(image,return_tensors="pt")

out = model.generate(**inputs)

print(processor.decode(out[0], skip_special_tokens=True))

Output

a woman sitting on the beach with her dog

Lesson 12: Visual Question & Answering

from transformers import BlipForQuestionAnswering
model = BlipForQuestionAnswering.from_pretrained(
    "./models/Salesforce/blip-vqa-base")
    
from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained(
    "./models/Salesforce/blip-vqa-base")
    
from PIL import Image

image = Image.open("./beach.jpeg") # 照片同lesson 10

question = "how many dogs are in the picture?"

inputs = processor(image, question, return_tensors="pt")


out = model.generate(**inputs)

print(processor.decode(out[0], skip_special_tokens=True)) # Output： 1

Lesson 13: Zero-Shot Image Classification

CLIP

在这里插入图片描述

from transformers import CLIPModel
model = CLIPModel.from_pretrained(
    "./models/openai/clip-vit-large-patch14")
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(
    "./models/openai/clip-vit-large-patch14")

from PIL import Image
image = Image.open("./kittens.jpeg")

Output

在这里插入图片描述

Set the list of labels from which you want the model to classify the image (above).

labels = ["a photo of a cat", "a photo of a dog"]

inputs = processor(text=labels,
                   images=image,
                   return_tensors="pt",
                   padding=True)
                   
outputs = model(**inputs)

outputs.logits_per_image # tensor([[18.9041, 11.7159]]

probs = outputs.logits_per_image.softmax(dim=1)[0] # tensor([9.9925e-01, 7.5487e-04]

probs = list(probs)
for i in range(len(labels)):
  print(f"label: {labels[i]} - probability of {probs[i].item():.4f}")

Output

label: a photo of a cat - probability of 0.9992
label: a photo of a dog - probability of 0.0008

后记

2024年3月9日完成这门课，这是HuggingFace团队在DeepLearning.AI网站上开的short course，主要是介绍HF中模型的使用。课程内容比较浅显，可以作为入门知识学习。