LLM应用构建前的非结构化数据处理(二)元数据的提取和文档切分

news2024/10/3 2:17:23

1.学习内容

本节次学习内容来自于吴恩达老师的Preprocessing Unstructured Data for LLM Applications课程,因涉及到非结构化数据的相关处理,遂做学习整理。

什么是元数据?元数据可以是文档级别的,也可以是元素级别的,它可以是我们从文档信息本身提取的内容,比如最后修改日期或文件名,也可以是我们在预处理文档时推断出的内容。元数据在构建RAG混合搜索作用非常大。如图:

在这里插入图片描述

其中的pagenumber、language就属于元数据。

元数据的作用是什么?如果您想将搜索限制在特定部分,您可以根据该元数据字段进行过滤,或者如果您想将结果限制在更近期的信息上,然后构造查询,以便仅返回在特定日期之后的文档。

2.相关环境准备

可以参考:LLM应用构建前的非结构化数据处理(一)

目录结构如图所示:
在这里插入图片描述
本次我们尝试解析epub数据,是电子书格式内容,同样的,需要unstructured.io上获取APIkey。

3.开始尝试

3.1导包初始化

# Warning control
import warnings
warnings.filterwarnings('ignore')

import logging
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)

import json
from IPython.display import JSON

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.chunking.basic import chunk_elements
from unstructured.chunking.title import chunk_by_title
from unstructured.staging.base import dict_to_elements

import chromadb
# 初始化API
s = UnstructuredClient(
    api_key_auth="XXX",
    server_url="https://api.unstrXXX",
)

3.2 查看电子书内容和格式

from IPython.display import Image
Image(filename='images/winter-sports-cover.png', height=400, width=400)

在这里插入图片描述

Image(filename="images/winter-sports-toc.png", height=400, width=400) 

在这里插入图片描述

3.3 解析书本

filename = "example_files/winter-sports.epub"

with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(),
        file_name=filename,
    )

req = shared.PartitionParameters(files=files)
try:
    resp = s.general.partition(req)
except SDKError as e:
    print(e)

JSON(json.dumps(resp.elements[0:3], indent=2))

输出如下:

[
  {
    "type": "Title",
    "element_id": "6c6310b703135bfe4f64a9174a7af8eb",
    "text": "The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson",
    "metadata": {
      "category_depth": 1,
      "emphasized_text_contents": [
        "Winter Sports in\nSwitzerland"
      ],
      "emphasized_text_tags": [
        "span"
      ],
      "languages": [
        "eng"
      ],
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "9ecb42d4f263247a920448ed98830388",
    "text": "\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online at\n",
    "metadata": {
      "link_texts": [
        "www.gutenberg.org"
      ],
      "link_urls": [
        "https://www.gutenberg.org"
      ],
      "link_start_indexes": [
        285
      ],
      "languages": [
        "eng"
      ],
      "parent_id": "6c6310b703135bfe4f64a9174a7af8eb",
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  },
  {
    "type": "NarrativeText",
    "element_id": "87ad8d091d5904b17bc345b10a1c964a",
    "text": "www.gutenberg.org. If you are not located\nin the United States, you’ll have to check the laws of the country where\nyou are located before using this eBook.",
    "metadata": {
      "link_texts": [
        "www.gutenberg.org"
      ],
      "link_urls": [
        "https://www.gutenberg.org"
      ],
      "link_start_indexes": [
        -1
      ],
      "languages": [
        "eng"
      ],
      "parent_id": "6c6310b703135bfe4f64a9174a7af8eb",
      "filename": "winter-sports.epub",
      "filetype": "application/epub"
    }
  }
]

3.4 过滤元数据类型是标题且包含hockey的章节

[x for x in resp.elements if x['type'] == 'Title' and 'hockey' in x['text'].lower()]

输出如下:

[{'type': 'Title',
  'element_id': '6cf4a015e8c188360ea9f02a9802269b',
  'text': 'ICE-HOCKEY',
  'metadata': {'category_depth': 0,
   'emphasized_text_contents': ['ICE-HOCKEY'],
   'emphasized_text_tags': ['span'],
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}},
 {'type': 'Title',
  'element_id': '4ef38ec61b1326072f24495180c565a8',
  'text': 'ICE HOCKEY',
  'metadata': {'category_depth': 0,
   'languages': ['eng'],
   'filename': 'winter-sports.epub',
   'filetype': 'application/epub'}}]

3.5 尝试按照指定章节拆分

chapters = [
    "THE SUN-SEEKER",
    "RINKS AND SKATERS",
    "TEES AND CRAMPITS",
    "ICE-HOCKEY",
    "SKI-ING",
    "NOTES ON WINTER RESORTS",
    "FOR PARENTS AND GUARDIANS",
]

# 找到上述章节对应的element_id
chapter_ids = {}
for element in resp.elements:
    for chapter in chapters:
        if element["text"] == chapter and element["type"] == "Title":
            chapter_ids[element["element_id"]] = chapter
            break
            
# 章节的key,value对调,方便后续查找
chapter_to_id = {v: k for k, v in chapter_ids.items()}
# 尝试找到元数据父节点为ICE-HOCKEY对应id的所有内容,并输出第一个结果
[x for x in resp.elements if x["metadata"].get("parent_id") == chapter_to_id["ICE-HOCKEY"]][0]         

输出如下:

{'type': 'NarrativeText',
 'element_id': 'c7c8e2f178cb0dc273ba7e811372640b',
 'text': 'Many of the Swiss winter-resorts can put\ninto the field a very strong ice-hockey team, and fine teams from other\ncountries often make winter tours there; but the ice-hockey which the\nordinary winter visitor will be apt to join in will probably be of the\nmost elementary and unscientific kind indulged in, when the skating day\nis drawing to a close, by picked-up sides. As will be readily\nunderstood, the ice over which a hockey match has been played is\nperfectly useless for skaters any more that day until it has been swept,\nscraped, and sprinkled or flooded; and in consequence, at all Swiss\nresorts, with the exception of St. Moritz, where there is a rink that\nhas been made for the hockey-player, or when an important match is being\nplayed, this sport is supplementary to such others as I have spoken of.\nNobody, that is, plays hockey and nothing else, since he cannot play\nhockey at all till the greedy skaters have finished with the ice.',
 'metadata': {'emphasized_text_contents': ['Many'],
  'emphasized_text_tags': ['span'],
  'languages': ['eng'],
  'parent_id': '6cf4a015e8c188360ea9f02a9802269b',
  'filename': 'winter-sports.epub',
  'filetype': 'application/epub'}}

3.6 尝试持久化

client = chromadb.PersistentClient(path="chroma_tmp", settings=chromadb.Settings(allow_reset=True))
client.reset() #输出True
collection = client.create_collection(
    name="winter_sports",
    metadata={"hnsw:space": "cosine"}
)
# 将元素数据内容存入chromadb,该过程构建可能需要五分钟左右
for element in resp.elements:
    parent_id = element["metadata"].get("parent_id")
    chapter = chapter_ids.get(parent_id, "")
    collection.add(
        documents=[element["text"]],
        ids=[element["element_id"]],
        metadatas=[{"chapter": chapter}]
    )
# 拿到数据并查看
results = collection.peek()
print(results["documents"])

输出如下:

['[Image\nunavailable.]', '[Image\nunavailable.]', 'Here is a remarkably varied programme, and one that will obviously\ngive a good spell of regular work to a candidate who intends to grapple\nwith it. It contains more of the material for skating than does the\ncorresponding English second test, in which only the four edges, the\nfour simple turns, and the four changes of edge are introduced, since\nthis International second test comprises as well as those, the four\nloops, and two out of the four brackets. These\nloops, which are most charming and effective figures, have nowadays no\nplace in English skating, since it is quite impossible to execute any of\nthem, as far as is at present known, without breaking the rules for\nEnglish skating, since the unemployed leg (i.e. the one not\ntracing the figure) must be used to get the necessary balance and swing.\nThey belong to a great class of figures like cross-cuts in all their\nvarieties, beaks, pigs-ears, &c., in which the skater nearly, or\nactually, stops still for a moment, and then, by a swing of the body or\nleg, resumes or reverses his movement. By this momentary loss and\nrecovery of balance there is opened out to the skater whole new fields\nof intricate and delightful movements, and the patterns that can be\ntraced on the ice are of endless variety. And here in this second\nInternational test the confines of this territory are entered on by the\nfour loops, which are the simplest of the “check and recovery” figures.\nIn the loops (the shape of which is accurately expressed by their names)\nthe skater does not come absolutely to a standstill, though very nearly,\nand the swing of the body and leg is then thrown forward in front of the\nskate, and this restores to it its velocity, and pulls it, so to speak,\nout of its loop. A further extension of this check and resumption of\nspeed occurs in cross-cuts, which do not enter into the International\ntests, but which figure largely in the performance of good skaters. Here\nthe forward movement of the skate (or backward movement, if back\ncross-cuts are being skated) is entirely checked, the skater comes to a\nmomentary standstill and moves backwards for a second. Then the forward\nswing of the body and unemployed leg gives him back his checked and\nreversed movement.', '[Image\nunavailable.]', '(a) A set of combined figures skated with another skater,\nwho will be selected by the judges, introducing the following calls in\nsuch order and with such repetitions as the judges may direct:—', 'CHAPTER\nVII', 'The figures need not be commenced from rest.', 'But when we consider that the first-class skater must be able to\nskate at high speed on any edge, make any turn at a fixed point, and\nleave that fixed point (having made his turn and edge in compliance with\nthe proper form for English skating, without scrape or wavering) still\non a firm and large-circumferenced curve, that he must be able to\ncombine any mohawk and choctaw with any of the sixteen turns, and any of\nthe sixteen turns with any change of edge, and that in combined skating\nhe is frequently called upon to do all these permutations of edge and\nturn, at a fixed point, and in\ntime with his partner, while two other partners are performing the same\nevolution in time with each other, it begins to become obvious that\nthere is considerable variety to be obtained out of these manœuvres. But\nthe consideration of combined skating, which is the cream and\nquintessence of English skating, must be considered last; at present we\nwill see what the single skater may be called upon to do, if he wishes\nto attain to acknowledged excellence in his sport.', 'Plate XXXII', 'He delivers the stone: the skip, eagle-eyed, watches the pace of it.\nIt may seem to him to be travelling with sufficient speed to reach the\nspot at which he desires it should rest. In this case he says nothing\nwhatever, except probably “Well laid down.” Smoothly it glides, and in\nall probability he will exclaim “Not a touch”: or (if he is very Scotch,\neither by birth or by infection of curling) “not a cow” (which means not\na touch of the besom). On the other hand he may think that it has been\nlaid down too weakly and will not get over the hog-line. Then he will\nshriek out, “Sweep it; sweep it” (or “soop it; soop it”) “man” (or\n“mon”). On which No. 2 and No. 3 of his side burst into frenzied\nactivity, running by the side of the stone and polishing the surface of\nthe ice immediately in front of it with their besoms. For, however well\nthe ice has been prepared, this zealous polishing assists a stone to\ntravel, and vigorous sweeping of the ice in front of it will give, even\non very smooth and hard ice, several feet of additional travel, and a\nstone that would have been hopelessly hogged will easily be converted\ninto the most useful of stones by diligent sweeping, and will lie a\nlittle way in front of the house where the skip has probably directed it\nto be. If he is an astute and cunning old dog, as all skips should be,\nhe will not want this first stone in the house at all; in fact, if he\nsees it is coming into the house, he will probably say “too strong.”\nYet, since according to\nthe rules only stones inside the house can count for the score, it seems\nincredible at first sight why he should not want every stone to be\nthere. This “inwardness” will be explained later.']

3.7 尝试查询

result = collection.query(
    query_texts=["How many players are on a team?"],
    n_results=2,
    where={"chapter": "ICE-HOCKEY"},
)
print(json.dumps(result, indent=2))

输出如下:

{
  "ids": [
    [
      "241221156e35865aa1715aa298bcc78d",
      "7a2340e355dc6059a061245db57f925b"
    ]
  ],
  "distances": [
    [
      0.5229756832122803,
      0.7836341261863708
    ]
  ],
  "metadatas": [
    [
      {
        "chapter": "ICE-HOCKEY"
      },
      {
        "chapter": "ICE-HOCKEY"
      }
    ]
  ],
  "embeddings": null,
  "documents": [
    [
      "It is a wonderful and delightful sight to watch the speed and\naccuracy of a first-rate team, each member of which knows the play of\nthe other five players. The finer the team, as is always the case, the\ngreater is their interdependence on each other, and the less there is of\nindividual play. Brilliant running and dribbling, indeed, you will see;\nbut as distinguished from a side composed of individuals, however good,\nwho are yet not a team, these brilliant episodes are always part of a\nplan, and end not in some wild shot but in a pass or a succession of\npasses, designed to lead to a good opening for scoring. There is,\nindeed, no game at which team play outwits individual brilliance so\ncompletely.",
      "And in most places hockey is not taken very seriously: it is a\ncharming and heat-producing scramble to take part in when the out-door\nday is drawing to a close and the chill of the evening beginning to set\nin; there is a vast quantity of falling down in its componence and not\nvery many goals, and a general ignorance about rules. But since a game,\nespecially such a wholly admirable\nand delightful game as ice-hockey, may just as well be played on the\nlines laid down for its conduct as not, I append at the end of this\nshort section a copy of the latest edition of the rules as issued by\nPrince\u2019s Club, London."
    ]
  ],
  "uris": null,
  "data": null,
  "included": [
    "metadatas",
    "documents",
    "distances"
  ]
}

3.8尝试对文件分块(chunk)

elements = dict_to_elements(resp.elements)

chunks = chunk_by_title(
    elements,
    combine_text_under_n_chars=100, # 对分块过小的不足100字符的进行合并
    max_characters=3000,
)

JSON(json.dumps(chunks[0].to_dict(), indent=2))

输出如下:

{
  "type": "CompositeElement",
  "element_id": "676ccd27-a9e4-46ea-80f1-00be45b60182",
  "text": "The Project Gutenberg eBook of Winter Sports in\nSwitzerland, by E. F. Benson\n\n\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online at\n\n\nwww.gutenberg.org. If you are not located\nin the United States, you’ll have to check the laws of the country where\nyou are located before using this eBook.",
  "metadata": {
    "emphasized_text_contents": [
      "Winter Sports in\nSwitzerland"
    ],
    "emphasized_text_tags": [
      "span"
    ],
    "filename": "winter-sports.epub",
    "filetype": "application/epub",
    "languages": [
      "eng"
    ],
    "link_texts": [
      "www.gutenberg.org",
      "www.gutenberg.org"
    ],
    "link_urls": [
      "https://www.gutenberg.org",
      "https://www.gutenberg.org"
    ],
    "orig_elements": "eJzNVNFum0AQ/JUVzw4BG9uQx0htVamqKsVVVYXIOrgFrsF36G4JIVH/vXvYTtPGitSHqn0y3p2dHXZGXD8G2OIONW2VDC4gWJWrRRwV62gRL5ZFhUm1SkQWrxOxFlWKRTCDYIckpCDB+MegFIS1seNWYkcNl2JG4K5rhFMPKLeE97QtjSbe4bh9HXxR/MfCVWcsOVA611eDoge0rdAyuDkxTqLej7pO6AlRqRa12KGXPEx8Z27iC7HrJ5EeQWM3IUTXtYqFKqPPj31eVveixj0x6jq4+c5lv8+PbBqET9Z8w5LgXc/iC7Q14KUxt2AqePUdZlCM8CaEtyFconZG+31HLRtFLQa86vfLZ1gWyVwm1Xy1mCdrkc2jJElRZmm6iBZp+uLyf+UGvqr07daRsKxMS7zft+fp8qnpj7SfGYYhrI/nCY19xtDbdo9piDp3cX5+GtsJ+wfpe25RrjeNcoCF94QfKmOB2LbeobdI6NFo9D9DgxbZpKn7WStCCVfEuXXclbneGUdguGmB1bCfPO2hg7GtBEGgDZQew2hgmxsQ7TTDdYuOrCr9WV2uh0aQM3iHNoSvpoedGHmyG0HRDGp1xyqYZeAqa7V45qVypecz77WziTvmOQh4GcAPquRE+Zcp217iQQ79vAPzGs335xen/JfgfRTWsv13uPH3OxHAdC1kKqMslsssSop4XZSLZFnEkYjLjH34xwE8i/+z/L3gC+F9BSPbzpScDYLW+K8jJ+xU9mYemvfzKM7aFhrB4SDOWYPl7QRuxfCUxNL0muwIU5Jzfdxx4IcCOfs++ErXhzD4D1X4iv03PwDRsgQb"
  }
}

可以看到,文件分块成功

print(len(elements)) # 输出752
print(len(chunks))   # 输出255

可以看到,文件共有752个元素模块,经过chunk后,最终形成了255个模块。

4. 总结

本节内容对元数据进行了学习,元数据对于文档数据的提取、文档的切分工作意义重大,但是也要注意,识别过程中可能会出现Title分类错误的问题,需要观察。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1912410.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

GOJS去除水印

GOJS gojs 去除水印 **查找go.js库搜索下面这段文本 String.fromCharCode(a.charCodeAt(g)^b[(b[c]b[d])%256]) 加入这段文本 if(f.indexOf(GoJS 2.1 evaluation)>-1|| f.indexOf(© 1998-2021 Northwoods Software)>-1|| f.indexOf(Not for distribution or produ…

中霖教育:中级会计师好考吗?

【中霖教育好吗】【中霖教育怎么样】 中级会计师考试的难度因考生的基础知识和经验而异,对于具备会计基础或已通过初级会计职称考试的人来说会更容易一些。 1. 考试科目少: 中级会计职称考试仅有三个科目,成绩有效期为两年,相较…

SVM - 径向基函数核 Radial Basis Function Kernel,简称RBF核或者高斯核

SVM - 径向基函数核 Radial Basis Function Kernel,简称RBF核或者高斯核 flyfish 径向基函数核(Radial Basis Function Kernel,简称RBF核),也称为高斯核,是一种常用的核函数,用于支持向量机&a…

软件测试之冒烟测试

🍅 视频学习:文末有免费的配套视频可观看 🍅 点击文末小卡片,免费获取软件测试全套资料,资料在手,涨薪更快 1. 核心 冒烟测试就是完成一个新版本的开发后,对该版本最基本的功能进行测试&#x…

兼容性报错--调整字符集解决

文章目录 错误解决办法Unicode 字符集(两个字节来表示一个字符)多字节字符集(一个字节来表示一个字符)如何选择字符集char与wchar_t的区别LPCSTR与LPCWSTR的区别 错误 解决办法 切换字符集类型 Unicode 字符集(两个字节来表示一个字符) 优点: 支持更多的字符集…

【银河麒麟】系统内存使用异常现象分析及建议

1.现象描述 问题机器系统内存占用长时间90%以上,同时伴随着高iowait,在故障时无法ssh登录,同时也影响生产业务。但之后系统内存占用会突然掉下来,在内存自己掉下来后能ssh登录。 2.显示分析 2.1 sa日志分析 查看问题机器3月15日…

STM32的 DMA(直接存储器访问) 详解

STM32的DMA(Direct Memory Access,直接存储器存取)是一种在单片机中用于高效实现数据传输的技术。它允许外设设备直接访问RAM,不需要CPU的干预,从而释放CPU资源,提高CPU工作效率,本文基于STM32F…

浏览器中js外挂脚本的执行方式

1、开发工具控制台交互执行 网页中按F12打开开发者工具,选择“控制台”,键入js脚本命令回车执行,适用于临时使用脚本逻辑简单的场景,实例如下: // 获取网页元素的文本脚本 var elem document.getElementById("…

7.x86游戏实战-C++实现跨进程读写-跨进程写内存

免责声明:内容仅供学习参考,请合法利用知识,禁止进行违法犯罪活动! 本次游戏没法给 内容参考于:微尘网络安全 上一个内容:6.x86游戏实战-C实现跨进程读写-通过基址读取人物状态标志位 上一个内容通过基…

硬盘分区读不出来的危机与数据拯救指南

在数字时代,硬盘作为我们存储珍贵数据的“保险箱”,其稳定性和可访问性至关重要。然而,当硬盘分区突然读不出来时,这份安全感瞬间化为泡影,让人心急如焚。本文将深入探讨硬盘分区读不出来的原因、提供两种实用的数据恢…

使用ssh服务器管理远程主机

前言:本博客仅作记录学习使用,部分图片出自网络,如有侵犯您的权益,请联系删除 目录 一、配置网卡服务 1、配置网卡参数 2、创建网络会话 3、绑定两块网卡 二、远程控制服务 1、配置sshd服务 2、在Windows连接 3、安全密钥…

云原生必知必会-docker安装

文章目录 一、docker安装二、centos7 安装docker-compose三、修改docker的镜像源四、docker异常处理(没有那个文件或目录)五、配置虚拟机上docker的代理总结 一、docker安装 # 安装上传下载工具 上传命令rz -bey,下载命令sz 文件名 yum -y i…

自定义刷题工具-python实现

背景: 最近想要刷题,虽然目前有很多成熟的软件,网站。但是能够支持自定义的导入题库的非常少,或者是要么让你开会员,而直接百度题库的话,正确答案就摆在你一眼能看见的地方,看的时候总觉得自己…

Xubuntu24.04之设置高性能模式两种方式(二百六十一)

简介: CSDN博客专家,专注Android/Linux系统,分享多mic语音方案、音视频、编解码等技术,与大家一起成长! 优质专栏:Audio工程师进阶系列【原创干货持续更新中……】🚀 优质专栏:多媒体系统工程师系列【原创干货持续更新中……】🚀 优质视频课程:AAOS车载系统+AOSP…

ArkUI开发学习随机——B站视频简介页面,美团购买界面

案例一:B站视频简介页面 代码: build() {Column(){Column(){Stack(){Image($r("app.media.genimpact")).width(200).height(125).borderRadius({topLeft:5,topRight:5})Row(){Image($r("app.media.bz_play")).height(24).fillColor…

虚拟机如何选择处理器和内核数量,实现最佳性能

一、基本概念 处理器数量指的是:虚拟的CPU数量。 每个处理器的内核数量指的是:虚拟CPU的内核数。 处理器内核总数处理器数量✖每个处理器的内核数量 此处虚拟机的处理器内核总数对应于真实物理机(或者叫宿主机)的CPU线程数&#x…

Python编程学习笔记(3)--- 操作列表

1、遍历列表 遍历列表可以采用for循环的方法,需要对列表中的每一个元素都执行相同的操作。 具体事实如下: name ["ada","cdb","dbc","bad","jinb"] for Name in name:print(Name)运行结果&#x…

灵活多变的对象创建——工厂方法模式(Python实现)

1. 引言 大家好,又见面了!在上一篇文章中,我们聊了聊简单工厂模式,今天,我们要进一步探讨一种更加灵活的工厂设计模式——工厂方法模式。如果说简单工厂模式是“万能钥匙”,那工厂方法模式就是“变形金刚”…

Windows10 企业版 LTSC 2021发布:一键点击获取!

Windows10企业版 LTSC 2021是微软发布的长达5年技术支持的Win10稳定版本,追求稳定的企业或者个人特别适合安装该系统版本。该版本离线制作而成,安全性高,兼容性出色,适合新老机型安装,力求带给用户更稳定、高效的操作系…

应急响应——勒索病毒

先上搜索引擎上搜 也可以用360来杀 但是都无法解密 可以解密的: linux