本次课程将着重介绍一下内容:
- 多模态模型基于 NIM 的调用方式
- 基于 NIM 接口实现 Phi-3-Vision 的推理实践
- 基于 Gradio 框架建立前端互动界面
申请NIM的API Key,来调用NIM的计算资源
进入NVIDIA NIM | phi-3-vision-128k-instruct, 点击Get API Key按钮,生成一个秘钥
第一步, 导入工具包
本次实验主要需要三个工具包:
langchain_nvidia_ai_endpoint
: 用来调用nvidia nim的计算资源langchain
: 用来构建对话链, 将智能体的各个组件串联起来base64
: 因为本实验是构建多模态的智能体, 需要base64来对图像进行编解码
from langchain_nvidia_ai_endpoints import ChatNVIDIA
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnableLambda
from langchain.schema.runnable.passthrough import RunnableAssign
from langchain_core.runnables import RunnableBranch
from langchain_core.runnables import RunnablePassthrough
from langchain.chains import ConversationChain
from langchain.memory import ConversationBufferMemory
import os
import base64
import matplotlib.pyplot as plt
import numpy as np
将上面准备好的秘钥粘贴在此处, 当我们向服务器发送计算请求时, 需要用到
os.environ["NVIDIA_API_KEY"] = "nvapi-Exxxxxx"
查看当前可以使用的模型
ChatNVIDIA.get_available_models()
第二步, 利用Microsoft Phi 3 vision 来解析图片数据
将图片进行编解码
def image2b64(image_file):
with open(image_file, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
return image_b64
image_b64 = image2b64("eco-good-bad-chart.png")
查看图片
from PIL import Image
display(Image.open("economic-assistance-chart.png"))
将编码后的图像按照格式给到Microsoft Phi 3 vision , 利用其强大能力解析图片中的数据
chart_reading = ChatNVIDIA(model="ai-phi-3-vision-128k-instruct")
result = chart_reading.invoke(f'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />')
print(result.content)
当然您也可以调用任何其他大模型来为您工作
下面的示例中, 调用Llama3 70b的模型,通过中文输入提示词,来进行工作
instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
# result = instruct_chat.invoke('How to implement Fibonacci in python using dynamic programming')
result = instruct_chat.invoke('怎么用 Python 实现快速排序')
print(result.content)
第三步, 使用 LangChain 构建多模态智能体
Agent 应用场景:将图片中的统计图表转换为可以用 python 进行分析的数据
Agent 工作流:
- 接收图片,读取图片数据
- 对数据进行调整、分析
- 生成能够绘制图片的代码,并执行代码
- 根据处理后的数据绘制图表
接收图片 -> 分析数据 -> 修改数据 -> 生成绘制图片的代码 -> 执行代码 -> 展示结果
辅助函数
这里的函数用于显示输入, 执行代码等, 在我们执行过程中可能会用到
import re
# 将 langchain 运行状态下的表保存到全局变量中
def save_table_to_global(x):
global table
if 'TABLE' in x.content:
table = x.content.split('TABLE', 1)[1].split('END_TABLE')[0]
return x
# helper function 用于Debug
def print_and_return(x):
print(x)
return x
# 对打模型生成的代码进行处理, 将注释或解释性文字去除掉, 留下pyhon代码
def extract_python_code(text):
pattern = r'```python\s*(.*?)\s*```'
matches = re.findall(pattern, text, re.DOTALL)
return [match.strip() for match in matches]
# 执行由大模型生成的代码
def execute_and_return(x):
code = extract_python_code(x.content)[0]
try:
result = exec(str(code))
#print("exec result: "+result)
except ExceptionType:
print("The code is not executable, don't give up, try again!")
return x
# 将图片编码成base64格式, 以方便输入给大模型
def image2b64(image_file):
with open(image_file, "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
return image_b64
定义多模态数据分析 Agent
- 这里首先定义了提示词模板, chart_reading_prompt, 我们输入的图片会边恒base64格式的string传输给它
- 将处理好的提示词输入给char_reading, 也就是microsoft/phi-3-vision大模型来进行数据分析, 得到我们需要的表格或者说table变量
- 将Phi3 vision处理好的table和提示词输入给另一个大模型llama3.1, 修改数据并生成代码
- 将生成的代码通过上面的执行函数来执行python代码, 并得到结果
def chart_agent(image_b64, user_input, table):
# Chart reading Runnable
chart_reading = ChatNVIDIA(model="ai-phi-3-vision-128k-instruct")
chart_reading_prompt = ChatPromptTemplate.from_template(
'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />'
)
chart_chain = chart_reading_prompt | chart_reading
# Instruct LLM Runnable
# instruct_chat = ChatNVIDIA(model="nv-mistralai/mistral-nemo-12b-instruct")
# instruct_chat = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
#instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")
instruct_prompt = ChatPromptTemplate.from_template(
"Do NOT repeat my requirements already stated. Based on this table {table}, {input}" \
"If has table string, start with 'TABLE', end with 'END_TABLE'." \
"If has code, start with '```python' and end with '```'." \
"Do NOT include table inside code, and vice versa."
)
instruct_chain = instruct_prompt | instruct_chat
# 根据“表格”决定是否读取图表
chart_reading_branch = RunnableBranch(
(lambda x: x.get('table') is None, RunnableAssign({'table': chart_chain })),
(lambda x: x.get('table') is not None, lambda x: x),
lambda x: x
)
# 根据需求更新table
update_table = RunnableBranch(
(lambda x: 'TABLE' in x.content, save_table_to_global),
lambda x: x
)
# 执行绘制图表的代码
execute_code = RunnableBranch(
(lambda x: '```python' in x.content, execute_and_return),
lambda x: x
)
chain = (
chart_reading_branch
#| RunnableLambda(print_and_return)
| instruct_chain
#| RunnableLambda(print_and_return)
| update_table
| execute_code
)
return chain.invoke({"image_b64": image_b64, "input": user_input, "table": table}).content
初始化
# 使用全局变量 table 来存储数据
table = None
# 将要处理的图像转换成base64格式
image_b64 = image2b64("image.png")
#展示读取的图片
from PIL import Image
display(Image.open("image.png"))
执行上面构建好的智能体
先将图片的数据转为字符串
user_input = "show this table in string"
chart_agent(image_b64, user_input, table)
print(table) # let's see what 'table' looks like now
让 Agent 自己尝试修改其中的内容
比如我们这里要把所有的UK 替换成United Kingdom
user_input = "replace table string's 'UK' with 'United Kingdom'"
chart_agent(image_b64, user_input, table)
print(table) # let's see what 'table' looks like now
用 python 绘制图表
这里会让大模型生成绘制图像的代码, 并执行生成的代码
!pip install gradio
!pip install pandas
user_input = "draw this table as stacked bar chart in python"
result = chart_agent(image_b64, user_input, table)
print("result: "+result) # let's see what 'table' looks like now
第四步, 将多模态智能体封装进Gradio
当我们完成上述任务的时候, 就拥有了一个可以分析图片, 生成代码, 修改数据, 执行代码的智能体
接下来我们给这个智能体添加一个UI界面, 让我们可以更舒服的与之对话
我们修改一下执行代码的函数, 因为原生的Python中exec函数的返回是Nan, 所以我们给他添加一个生成图片的路径
在下面的代码中, 路径是作者的PC中的路径, 请您根据您自己的系统更换这个文件夹路径
global img_path
# img_path = '/home/nvidia/2024_summer_bootcamp/day3/'+'image.png'
img_path = 'C:\\Users\\wei\\Documents\\nvidiafiles\\day3\\'+'image.png'
print(img_path)
def execute_and_return_gr(x):
code = extract_python_code(x.content)[0]
try:
result = exec(str(code))
#print("exec result: "+result)
except ExceptionType:
print("The code is not executable, don't give up, try again!")
return img_path
这个chart_agent函数的输入原来是base64格式, 但是gradio中上传图片的格式是png或jpg等图片格式
所以我们更新了这个函数, 在最开始的步骤中加入了一个编码的过程
def chart_agent_gr(image_b64, user_input, table):
image_b64 = image2b64(image_b64)
# Chart reading Runnable
chart_reading = ChatNVIDIA(model="microsoft/phi-3-vision-128k-instruct")
chart_reading_prompt = ChatPromptTemplate.from_template(
'Generate underlying data table of the figure below, : <img src="data:image/png;base64,{image_b64}" />'
)
chart_chain = chart_reading_prompt | chart_reading
# Instruct LLM Runnable
# instruct_chat = ChatNVIDIA(model="nv-mistralai/mistral-nemo-12b-instruct")
# instruct_chat = ChatNVIDIA(model="meta/llama-3.1-8b-instruct")
#instruct_chat = ChatNVIDIA(model="ai-llama3-70b")
instruct_chat = ChatNVIDIA(model="meta/llama-3.1-405b-instruct")
instruct_prompt = ChatPromptTemplate.from_template(
"Do NOT repeat my requirements already stated. Based on this table {table}, {input}" \
"If has table string, start with 'TABLE', end with 'END_TABLE'." \
"If has code, start with '```python' and end with '```'." \
"Do NOT include table inside code, and vice versa."
)
instruct_chain = instruct_prompt | instruct_chat
# 根据“表格”决定是否读取图表
chart_reading_branch = RunnableBranch(
(lambda x: x.get('table') is None, RunnableAssign({'table': chart_chain })),
(lambda x: x.get('table') is not None, lambda x: x),
lambda x: x
)
# 根据需求更新table
update_table = RunnableBranch(
(lambda x: 'TABLE' in x.content, save_table_to_global),
lambda x: x
)
execute_code = RunnableBranch(
(lambda x: '```python' in x.content, execute_and_return_gr),
lambda x: x
)
# 执行绘制图表的代码
chain = (
chart_reading_branch
| RunnableLambda(print_and_return)
| instruct_chain
| RunnableLambda(print_and_return)
| update_table
| execute_code
)
return chain.invoke({"image_b64": image_b64, "input": user_input, "table": table})
这里是示意提示词, 放大家打开Gradio页面时候直接复制粘贴
user_input = "replace table string's 'UK' with 'United Kingdom', draw this table as stacked bar chart in python, and save the image in path: "+img_path
print(user_input)
执行上述代码, 将打开一个Gradio的服务, 我们可以利用Gradio的页面与构建好的智能体对话
import gradio as gr
multi_modal_chart_agent = gr.Interface(fn=chart_agent_gr,
inputs=[gr.Image(label="Upload image", type="filepath"), 'text'],
outputs=['image'],
title="Multi Modal chat agent",
description="Multi Modal chat agent",
allow_flagging="never")
multi_modal_chart_agent.launch(debug=True, share=False, show_api=False, server_port=5000, server_name="0.0.0.0")
执行完后,可以打开 http://localhost:5000.