LLM 构建Data Multi-Agents 赋能数据分析平台的实践之②:数据治理之二(自动处理)

news2024/11/16 19:23:26

前述

在前文的multi Agents for Data Analysis的设计说起,本文将继续探索和测试借助llm实现基于私有知识库的数据治理全自动化及智能化。整体设计如下:
在这里插入图片描述
整个体系设计了3个Agent以及一个Planer&Execute Agent,第一个Agent用于从企业数据标准私有知识库中检索生成与用户问题相关联的知识块,第二个Agent用于结合企业或者用户的数据做数据质量分析,第三个Agent用于根据用户的问题或者其他Agent的结果总结生成报告;Planer&Excute Agent用于根据用户的问题规划任务及调度Agents执行。**本文实践的例子流程是这样的:**根据用户的私有知识库检索数据治理的流程或者数据的标准范围,根据该标准或者流程,分析用户上传的数据的异常值及数据质量,撰写数据质量报告。

一、Agent①:知识检索

使用create_retriever_tool及 ZeroShotAgent构建第一个Agent用于从私有知识库检索生成相关的知识块。

model_id = "iic/nlp_corom_sentence-embedding_english-base"
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
loader = Docx2txtLoader('./standard documents (1).docx')
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, chunk_overlap=150)
split_docs = text_splitter.split_documents(docs)
vectordb = Chroma.from_documents(documents=split_docs,embedding=embeddings)
retriever_tool = create_retriever_tool(
    vectordb.as_retriever(), "Search_for_data_retriever", "which you can retriever relate data block."
)

tools_kg = [retriever_tool]
memory = ConversationBufferMemory(
    memory_key="chat_history",
    human_prefix="Input",
    ai_prefix="Response",
    input_key="question",
    return_messages=False,
)
template_kg = 'You are a knowledge retrieval and generation tool that searches for relevant text based on user questions and generates answers accordingly.'
agent_kg = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools_kg,
            prefix=template_kg,
        )

二、Agent②:数据分析

1、提示词设计:包含三个部分,工具的简单介绍,数据基础信息的提示(dhead),可参考的例子。

TEMPLATE_dt = """You are working with a pandas dataframe in Python. The name of the dataframe is `df`.
It is important to understand the attributes of the dataframe before working with it. This is the result of running `df.head().to_markdown()`

<df>
{dhead}
</df>

You are not meant to use only these rows to answer questions - they are meant as a way of telling you about the shape and schema of the dataframe.
You also do not have use only the information here to answer questions - you can run intermediate queries to do exporatory data analysis to give you more information as needed.

You possess  essential tools:  `python_repl_dt`: With this tool, you can analyze and process the data retrieved from df using Python code.

When facing a question, assess whether you need to employ these tools iteratively.

Example:
<question>It is known that the price interval for apples in the market is between 5 and 20 yuan per kilogram; your task is to detect any anomalous values within the set of apple market price quotes. </question>
<logic>
  First, To begin with, you should scrutinize the user's query, where in the user has furnished an essential detail: that the benchmark data range for apple prices is from 5 to 20 yuan per kilogram.
  Then, leverage `python_repl_dt` to detect any anomalous values based on the retrieved data.
  the code write by python_repl_dt:
  '''import pandas as pd
     df[df['apple'].lt(10) | df['apple'].gt(20)].to_csv('abnormal_data.csv')
  the output should print the anomalous,and output a file named apple market price anomalous.csv.
</logic>
If you have not found the precise answer, please iteratively use the tools.
"""

2、工具的构建及Agent的组装

#使用PythonAstREPLTool工具
repl = PythonAstREPLTool(
            locals={"df": df},
            name="python_repl_dt",
            description="The tool is used to generate Python code analysis based on the 'df' named pig_market_data ,runs the code and outputs both the code and the results of the computation.",
            #args_schema=PythonInputs,
)
tools_re = [repl]
#Agent_dt
template_dt = TEMPLATE_dt.format(dhead=df.head().to_markdown())
agent_dt = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools_re,
            prefix=template_dt,
        )
agent_dt = AgentExecutor(agent=agent_dt, tools=tools_re, max_iterations=50, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

三、Agent③:总结报告

1、提示词设计:设计一个总结报告的基础框架,包含目的、背景、过程、结论、建议等。

template_bg = """
    As a Data Analys, please compose a comprehensive data report addressing the user's inquiry:{question}, adhering to the following guidelines:

  1.Introduction: Begin with a clear and concise overview of the purpose of the report, briefly outlining the user's question or problem statement. This section should set the context and provide a roadmap for the content that follows.

  2.Data Scope and Source: Specify the scope of the data analyzed, including relevant timeframes, geographical coverage, and any specific datasets utilized. Mention the sources from which the data was obtained, emphasizing their credibility and relevance to the analysis.

  3.Methodology: Describe the analytical methods employed to examine the data, including any statistical techniques, models, or tools used. Explain why these methods were chosen and how they contribute to answering the user's question. Outline any assumptions, limitations, or caveats associated with the methodology.

  4.Key Findings: Present the main insights derived from the analysis in a structured and visually appealing manner, using tables, charts, graphs, or other appropriate visualizations. Accompany each finding with a clear explanation and, where applicable, quantitative measures such as percentages, averages, or trends. Ensure findings are directly responsive to the user's inquiry and are contextualized within the broader data landscape.

  5.Interpretation and Implications: Interpret the key findings, drawing meaningful conclusions and highlighting their significance. Relate these conclusions back to the user's question, explaining how they address the initial concerns or objectives. Discuss any potential implications for business decisions, strategy, or further research, and offer actionable recommendations where appropriate.

  6.Quality Assurance and Limitations: Discuss the steps taken to ensure data quality throughout the analysis process, such as data cleaning, validation, and outlier detection. Acknowledge any limitations or challenges encountered during the analysis, including data gaps, inconsistencies, or inherent biases, and discuss how these may have influenced the results and conclusions.

  7.Conclusion and Next Steps: Summarize the key takeaways from the report, reinforcing the most important findings and their implications. Suggest potential avenues for future analysis or data collection that could further enhance understanding or address remaining questions. Encourage user engagement by inviting feedback or follow-up inquiries.

  8.Appendix and Supporting Materials: Include any additional information, detailed calculations, or raw data that support the analysis but would disrupt the flow of the main report. This might include detailed statistical outputs, full dataset summaries, or detailed descriptions of complex methodologies.

  By adhering to these guidelines, your data report will effectively communicate the results of your analysis, address the user's question thoroughly, and provide a robust foundation for informed decision-making.
"""

2、Agent的构建

prompt = ChatPromptTemplate.from_template(template_bg)
chain_bg = (
    {"question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

四、Planer&Execute Agent设计及构建

1、方式一:使用ZeroShotAgent+提示词的方式构建一个任务规划及调度执行的Agent
(1)提示词的设计:包含三部分,介绍该Agent的基础信息(角色定位、工具库)、任务执行的流程、可参考的例子

template= """
            As an AI data analyst, it is crucial to optimize workflow by utilizing tools within the tool library, especially "Search_for_data_standard" 、 "Analysis_for_pig_market_data" and "Tool_for_Data_analysis_report". Below are streamlined methods for efficiently handling user inquiries:

1. **Task Understanding & Strategy Formulation**:
   - Firstly, comprehensively understand the user's requirements and constraints, aligning with industry expertise and best practices.
   - Develop personalized data analysis strategies, setting operational guidelines for each data item based on established norms.
   - Based on the user's questions, output a matching task list.

2. **Utilizing "Search_for_data_standard" to Establish Data Analysis Framework and Standards**:
   - Within the strategy, define standards and guidelines for each data item to ensure alignment with business context and analysis objectives.

3. **Performing Data Analysis Using "Analysis_for_pig_market_data" Tool**:
   - Employ the "Analysis_for_pig_market_data" tool to analyze data uploaded by the user, following the framework and standards established by "Search_for_data_standard".
   — That is, the input of tool "Analysis_for_pig_market_data" should contain the framework and standards established by "Search_for_data_standard".
4. ** Utilize the "Tool_for_Data_analysis_report" to craft a report.
   — Compose a data analysis report based on industry knowledge and standards (query by "Search_for_data_standard" ), data analysis results("Analysis_for_pig_market_data" output), and user inquiries.

Example:
<question>Please calculate the cost and profit of the company's batch products for this year, and identify the batches of products that exceed the average cost. And generate a financial report</question>
<logic>
  First, Based on the user's questions, plan a task list and execution process:
  1. Use tool "Search_for_data_standard" to search and generate the company's financial management system and cost accounting process;
  2. According to the system or process, use tool "Analysis_for_pig_market_data" to analyze financial data, including batch cost, profit, and batch product ranking;
  3. Write financial analysis reports using "Tool_for_Data_analysis_report" tools based on management systems, cost accounting processes, data analysis, etc.
<task1>Use tool "Search_for_data_standard" to search and generate......
task1 output: The cost of product batches should be between 10-20 yuan/kg
<task2>use tool "Analysis_for_pig_market_data" to analyze financial data......
task2 input: the batch cost of the product should be between 10-20 yuan/kg. Analyze products with unreasonable batch costs
task2 output:
 '''import pandas as pd
     df[df['apple'].lt(10) | df['apple'].gt(20)].to_csv("the batch cost of the product  anomalous.csv", index=False)'''
<task3>Write a report using tool "Tool_for_Data_analysis_report"......
task3 input: the company's financial management system and cost accounting process,Financial data analysis results.....
task3 output: Our company's batch cost analysis overview is as follows: the average cost of a batch is 15 yuan/batch, and the batches of products that exceed the abnormal cost warning value include A, B, C, etc......
</logic>

        """

(2)工具库及P-Agent的构建

tools = [
            Tool(
                name = "Search_for_data_standard",
                func=agent_KG.invoke,
                description="By retrieving relevant text, obtain industry standards, data analysis strategies related to user questions, and summarize and generate answers."
                ),
            Tool(
                name = "Analysis_for_pig_market_data",
                func=agent_dt.invoke,
                description="A tool used to excute code analyze the situation in the data named pig_market_data, where the data is the slaughter weight and price data of the pig market."
                ),
            Tool(
                name = "Tool_for_Data_analysis_report",
                func=chain_bg.invoke,
                description="As a Data Analyst, please compose a comprehensive data report addressing the user's inquiry."
                ),
        ]

agent = ZeroShotAgent.from_llm_and_tools(
            llm=llm,
            tools=tools,
            prefix=template,
        )
agent_executor = AgentExecutor(agent=agent, tools=tools, max_iterations=150, handle_parsing_errors=True,early_stopping_method="generate",verbose=True)

2、方式二、使用langchain 的plan & excute模块

from langchain_experimental.plan_and_execute import PlanAndExecute, load_agent_executor, load_chat_planner

planner = load_chat_planner(llm,template)
executor = load_agent_executor(llm, tools, verbose=True)

agent = PlanAndExecute(planner=planner, executor=executor, verbose=True)

五、实践结果讨论

任务一:
从标准私有知识库中检索数据标准(生猪体重标准范围),根据此标准分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'''Obtain the weight standards for pigs in the market; 
According to this standard, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,
And generate a csv containing the identified outliers,above all, 
Write a data quality analysis report.'''})

①任务规划:

steps=[Step(value='Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market; ')
Step(value='According to the weight standards, use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country, identifying abnormal weight values that exceed the slaughter standard weight range; ')
Step(value='Generate a CSV file containing the identified outliers;')
Step(value='Write a data quality analysis report based on the weight standards, data analysis results, and identified outliers.\n</logic>\n\n<task1>Use tool "Search_for_data_standard" to search and generate weight standards for pigs in the market.\ntask1 output: The weight standards for pigs in the market should be between 100-150 kg.\n\n<task2>Use tool "Analysis_for_pig_market_data" to analyze the data named pig_market_data from various provinces and cities across the country.\ntask2 input: The weight standards for pigs in the market should be between 100-150 kg. Analyze data to identify abnormal weight values that exceed the slaughter standard weight range.\ntask2 output: \n```python\nimport pandas as pd\ndf[df[\'weight\'].lt(100) | df[\'weight\'].gt(150)].to_csv("abnormal_weight_values.csv", index=False)\n```\n\n<task3>Write a report using tool "Tool_for_Data_analysis_report" based on weight standards, data analysis results, and identified outliers.\ntask3 input: Weight standards for pigs in the market, data analysis results, identified outliers in the data.\ntask3 output: The data quality analysis report includes information on the weight standards, analysis of abnormal weight values exceeding the slaughter standard weight range, and recommendations for data quality improvement.\n')]

②执行第一个Agent:知识检索生成,从私有知识库检索生成生猪出栏体重标准为70~200kg。
在这里插入图片描述

③执行第二个Agent:数据质量分析,根据生猪出栏体重标准范围【70,200】,查询异常值并生成异常数据文件。
在这里插入图片描述

④执行第三个Agent:数据质量报告撰写,根据提供的数据分析报告模板撰写报告。
在这里插入图片描述
任务二:从私有知识库中检索生成数据质量分析流程根据此流程分析用户数据中的异常值,生成一份异常数据表,并根据数据标准与异常数据情况撰写数据质量分析报告。

agent_executor.invoke({'input':'Obtain the data quality analysis process for pigs in the market; According to this process, Identify abnormal weight values in data named pig_market_data from various provinces and cities across the country that exceed the slaughter standard weight range,And generate a csv containing the identified outliers,above all, Write a data quality analysis report.'})

①任务规划

I need to first search for the data quality analysis process for pigs in the market to understand the standards and guidelines. Then, I should analyze the pig_market_data to identify abnormal weight values that exceed the slaughter standard weight range. Finally, I need to generate a csv file with the identified outliers and write a data quality analysis report.

在这里插入图片描述

②执行第一个Agent:检索生成数据分析流程

Observation: {'input': 'Data quality analysis process for pigs in the market', 'output': 'The data quality analysis process for pigs in the market involves querying abnormal data, using methods like Z-score and GNN to discover anomalies, and generating a data report to evaluate data quality in terms of integrity, consistency, timeliness, and accuracy.'}

在这里插入图片描述
③执行第二个Agent:根据数据质量分析流程对数据质量分析,使用Z-score算法
在这里插入图片描述
生成的异常数据文件
在这里插入图片描述

④执行第三个Agent:数据质量分析报告
在这里插入图片描述

六、讨论

通过本次测试,初步验证了将LLM构建的Data Multi-Agents嵌入数据平台用作数据治理的可行性,并实现了多个Agent的协同工作、以及根据私有知识库自动分析数据质量。在测试过程中有如下问题尚需考虑:
1、大模型选择问题:本次实践针对每个Agent的核心—大模型均做了测试,理论上每个Agent可以使用不同的大模型作为核心。经过本次测试,有如下考虑或者建议,针对每个Agent的用途可以选择适合的大模型,比如数据分析(python code)可以选择代码生成能力强的大模型,Plan&Excute Agent可以选择任务规划能力强的大模型,甚至针对特定领域需要微调一个适合、匹配的大模型。
2、知识检索Agent可能还需借助最新的RAG技术,以获取在海量数据标准中能检索匹配生成到更准确的知识块。
3、数据质量分析Agent应对多表、多库以及各表之间的关联等更复杂的场景,需要建立一个数据资源基础信息库,如元数据、数据血缘、数据库表关联关系等信息,大模型处理起来会更全面和精准。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1574957.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

蓝桥杯真题代码记录(数位排序

目录 1. 题目&#xff1a;2. 我的代码&#xff1a;小结&#xff1a; 1. 题目&#xff1a; 小蓝对一个数的数位之和很感兴趣, 今天他要按照数位之和给数排序。当 两个数各个数位之和不同时, 将数位和较小的排在前面, 当数位之和相等时, 将数值小的排在前面。 例如, 2022 排在 40…

这个开发板在线仿真网站你一定不能错过

大家好&#xff0c;我是知微&#xff01; 今天给大家推荐一个免费的在线的开发板仿真网站&#xff0c;你可以使用它来仿真Arduino、ESP32和许多其他流行的电路板、元器件以及传感器&#xff0c;免去初期需要购买开发才能学习的困扰。 它就是Wokwi&#xff0c;网址如下 https:…

10个程序员可以接私活的平台和一些建议

话不多说&#xff0c;直接进入正题。我把我压箱底的10个程序员接私活的平台都拿出来了&#xff0c;看之前记得先点赞收藏~ 码市 互联网网站外包服务平台&#xff0c;这个平台上还有产品原型可供参考。在码市上有一系列规范的接单和发单流程答疑过程&#xff0c;可以很好地帮助…

YouTube首席执行官指控OpenAI违反服务条款:AI训练数据伦理之争加剧

每周跟踪AI热点新闻动向和震撼发展 想要探索生成式人工智能的前沿进展吗&#xff1f;订阅我们的简报&#xff0c;深入解析最新的技术突破、实际应用案例和未来的趋势。与全球数同行一同&#xff0c;从行业内部的深度分析和实用指南中受益。不要错过这个机会&#xff0c;成为AI领…

回溯法(一)——全排列 全组合 子集问题

全排列问题 数字序列 [ l , r ] [l,r] [l,r]​区间内元素的全排列问题 extern int ans[],l,r,num;//num&#xff1a;方案数 extern bool flag[]; void dfs(int cl){//cl:current left&#xff0c;即为当前递归轮的首元素if(cl r 1){//数组已越界&#xff0c;本轮递归结束for…

全国多年平均水汽压空间分布数据

引言 地理遥感生态网结合1971-2021年各地区地面气象监测站数据&#xff0c;应用气候数据空间插值软件Anusplin预测全国平均水汽压分布数据成果。得出全国各个省市自治区平均水汽压分布图&#xff0c;全国各省市自治区平均水汽压数据产品是地理遥感生态网推出的气象气候类数据产…

非关系型数据库——三万字Redis数据库详解

目录 前言 一、Redis概述 1.主要特点 2.Redis优缺点 3.Redis为什么这么快 4.Redis那么快&#xff0c;为什么不用它做主数据库&#xff0c;只用它做缓存 5.线程模型 5.1单线程架构 5.2多线程IO处理&#xff08;Redis 6及以上&#xff09; 5.3线程模型的优化 6.作用 …

基于Difussion图像、视频生成综述

2024年大年初七&#xff08;02.16&#xff09;OpenAI 发布视频生成模型 Sora 在各大平台转疯了&#xff0c;和2022年发布ChatGPT3.5时一样的疯狂。在开工第一天&#xff0c;我就去官网上看了 Sora 的技术报告&#xff0c;遗憾的是&#xff0c;在这份技术报告中只披露了一些模型…

文库配置异步转换(宝塔)| 魔众文库系统

执行以下操作前提前进入网站根目录&#xff0c;如 cd /www/wwwroot/example.com执行 artisan 命令前请参照 开发教程 → 开发使用常见问题 → 如何运行 /www/server/php/xxx/bin/php artisan xxx 命令 步骤1&#xff0c;生成数据库队列表迁移文件 在执行该步骤前&#xff0c;请…

记一次农业工程学报投稿流程与感悟

经过数段时间的实验与熬夜&#xff0c;终于得出一个比较满意的结果&#xff0c;本想着第一篇先随便发一个试试投稿流程&#xff0c;但是经过老师修改后非让投农业工程学报&#xff0c;然后在网上查了一些信息后有点害怕&#xff0c;大致都是在说周期长&#xff0c;审稿慢等等 …

GPT-4、PaLM-2等AI模型对黑人or女性存在偏见?丨AI偏见的案例和应对

生成式 AI&#xff08;Generative AI&#xff09;以其卓越的能力在模仿和理解人类智能方面不断突破界限&#xff0c;展现出令人瞩目的潜力。但与此同时&#xff0c;AI 系统在提供这些创新服务的过程中&#xff0c;有时也会暴露出一些问题&#xff0c;尤其是在文化和种族方面的偏…

基于java+SpringBoot+Vue的房屋租赁系统设计与实现

基于javaSpringBootVue的房屋租赁系统设计与实现 开发语言: Java 数据库: MySQL技术: Spring Boot JSP工具: IDEA/Eclipse、Navicat、Maven 系统展示 前台展示 房源浏览模块&#xff1a;展示可租赁的房源信息&#xff0c;用户可以根据条件筛选房源。 预约看房模块&#…

docker 安装redis报错:can not init background jbos

启动redis&#xff0c;发现一直再重启 docker run -d --name redis -p 6379:6379 --restartalways redis:6.2.6 --requirepass "123456" 查看日志&#xff0c;发现job没启动 docker logs 47f6572a779c 尝试了一堆解决办法。。。最后发现尝试安装了redis6.2.6版本&a…

算法第三十九天-验证二叉树的前序序列化

验证二叉树的前序序列化 题目要求 解题思路 方法一&#xff1a;栈 栈的思路是「自底向上」的想法。下面要结合本题是「前序遍历」这个重要特点。 我们知道「前序遍历」是按照「根节点-左子树-右子树」的顺序遍历的&#xff0c;只有当根节点的所有左子树遍历完成之后&#xf…

Linux:进程终止和等待

一、进程终止 main函数的返回值也叫做进程的退出码&#xff0c;一般0表示成功&#xff0c;非零表示失败。我们也可以用不同的数字来表示不同失败的原因。 echo $?//打印最近一次进程执行的退出码 而作为程序猿&#xff0c;我们更需要知道的是错误码所代表的错误信息&#x…

探索7个MAMP本地开发环境的高效替代软件

什么是本地开发环境 本地开发环境是Web开发环境中的一种类型&#xff0c;它是指开发者自己的计算机上配置的一套用于开发和测试网站或应用程序的软件集合。这套环境使得开发者可以在本地计算机上构建和测试网站&#xff0c;而无需实时部署到服务器。 创建本地开发环境有两种方…

Python实现【坦克大战】+源码分享

写在前面&#xff1a; 坦克大战&#xff0c;这款经典的电子游戏&#xff0c;无疑是许多80后和90后心中不可磨灭的童年记忆。它不仅仅是一款游戏&#xff0c;更是那个时代科技娱乐方式的缩影&#xff0c;见证了电子游戏行业的起步与发展。 在那个电脑和网络尚未完全普及的年代…

云备份day04

&#x1f4df;作者主页&#xff1a;慢热的陕西人 &#x1f334;专栏链接&#xff1a;C云备份项目 &#x1f4e3;欢迎各位大佬&#x1f44d;点赞&#x1f525;关注&#x1f693;收藏&#xff0c;&#x1f349;留言 主要内容介绍了文件工具了类的实现 文章目录 云备份day041.文件…

00-armv8/armv9中断系列详解-序言

快速链接: 【精选】ARMv8/ARMv9架构入门到精通-[目录] &#x1f448;&#x1f448;&#x1f448; 1、序言 带着问题去学习&#xff0c;关于异常/中断的一些思考: (1)、在如下的一个大系统种&#xff0c;cpu正在optee os中运行&#xff0c;突然来了一个想给Linux Kernel处理的中…

JS与Python函数在语法的区别

区别 标题语法&#xff1a;Python使用缩进来表示代码块&#xff0c;而JavaScript使用大括号{}。 Python函数定义&#xff1a; def my_function():# 函数体JavaScript函数定义&#xff1a; function myFunction() {// 函数体 }标题参数传递&#xff1a;Python支持位置参数、…