大模型专业术语语料如何采集!XPath高级技巧助力狩猎

news2024/11/27 6:17:35

最近采集大模型的某领域的专业术语语料,却发现一些网站的专业术语和名称解释深藏在一个 <div> 节点下的多个同级 <p> 节点中。这样的结构让爬虫在使用 .//text() 后获取文本时变得十分头疼,难以准确区分是术语还是解释。😓。

不过,别担心!在这里,我将介绍一个简单而有效的案例,借助XPath的高级技巧,精准地获取同级 <p> 节点的不同信息。让我们一起来看看吧!🕵️‍♂️

目标网站:Mining from A to Z - BEUMER Group

上述图片中的矿业专业术语和解释都处在同级的p节点中,但我们可以发现专业术语明显是使用了加粗字体,所以专业术语会在<strong>节点中,根据这种思路我们可以遍历判断是否为strong节点就能区分是否为术语和解释了。

话不多说,上代码:

import requests, json
from lxml import etree
response = requests.get('https://www.beumergroup.com/knowledge/mining/our-mining-glossary/' )#, cookies=cookies, headers=headers)
html = etree.HTML(response.text, etree.HTMLParser())
res = html.xpath('//div[@class="content content--textLeft"]/p')

data = {'title':None, "content":''}
for p in res:
    tags = p.xpath('./*')
    if not tags and p.xpath('./text()'):
        data['content'] += p.xpath('./text()')[0] + '\n '
    for tag in tags:
        print("tag",tag.tag)
        if tag.tag == 'strong':
            if data.get("title"):
                with open("beumergroup.jsonl", 'a') as f:
                    f.write(json.dumps(data, ensure_ascii=False) + '\n')
            data = {'title':None, "content":''}
            data['title'] = tag.xpath('./text()')[0]
            print(tag.xpath('./text()'))
        else:
            text = tag.xpath('./text()')
            if text:
                data['content'] += text[0] + '\n '
    print(data)
        

这段代码主要通过 res = html.xpath('//div[@class="content content--textLeft"]/p')获取了包含专业术语和解释的所有p节点,通过遍历所有的p节点,来判断节点是否为strong节点,如果是strong节点就把它放进title字段中,不为strong节点就放进content字段中,最终输出结构如下:

{"title": "Amphibolites", "content": "Rock\n Amphibolite is a metamorphic rock. The main components are amphiboles (mostly hornblende) and plagioclase. Quartz, garnet, diopside, epidote and biotite may also be present. The chemical composition of the amphibolites is metabasic. They are obtained from basic magmatites such as gabbros, basalts, andesites and their tuffs or from marls and tuffites (para-amphibolites)\n  \n "}
{"title": "Angle of repose (slope)", "content": "Bulk materials property\n For cohesion-free, free-flowing, grainy materials such as grain, granulated mineral fertilizer, limestone, pellets, coke, etc., the natural angle of repose – also known as the angle of slope – is the maximum angle at which individual surface particles stop sliding down. The natural angle of repose can be easily determined by allowing the bulk material to flow from the base area of a cylinder-shaped sampler at low tipping height onto a firm horizontal support.\n  \n "}
{"title": "Anthracites", "content": "Bituminous coal\n Anthracite is a bituminous coal with a volatile content of less than ten percent. This high-grade form of coal is extremely hard. Anthracite forms from vegetable matter under high pressure and in the absence of air. This increases the carbon content. This lies at over 91.5 percent by weight. Anthracite is particularly prized as a fuel due to its high energy content, the hot flame it produces and its combustion properties which mean that it leaves almost no residue.\n "}
{"title": "Backfill", "content": "Cavity filling\n In mining, the term backfilling refers to the filling-in of the cavity between the excavated area and the rock mass using suitable materials. Various materials (such as gravel) and a number of different technologies are used for backfilling.\n  \n "}
{"title": "Ball mill", "content": "Grinding of ore\n A ball mill is a mill for ultrafine grinding or homogenization. It consists of a rotating grinding jar (steel cylinder) into which the material to be ground (ore) is fed. Grinding balls made of various materials are added – in the case of ore, the balls are made of steel. Rotation of the ball mill pulverizes the ore.\n  \n "}
{"title": "Banded iron formation", "content": "Sedimentary rock\n Banded iron formations are units of iron-bearing marine sedimentary rock that were primarily deposited in the Precambrian period. They possess a characteristic banded structure due to their metal-bearing layers. The layers primarily consist of ferrous minerals with a vertical cross-section reminiscent of bands. This is what gives them their name.\n  \n "}
{"title": "Basalt", "content": "Igneous rock\n Basalt is a basic igneous rock. It consists in particular of iron and magnesium silicates with pyroxenes and calcium-rich feldspar (plagioclase) and also usually also olivine. Basalt is the volcanic equivalent of gabbro (plutonite), which has the same chemical composition. Basalt is usually dark gray to black. Because the rock is formed through the action of volcanic processes, the groundmass is usually fine-grained due to the rapid cooling it undergoes.\n  \n "}
{"title": "Base metal", "content": "Base metal\n Base metals are characterized by the fact that they oxidize – under normal conditions they react with oxygen in the air. For example, iron rusts. Zink and aluminum protect themselves through passivation, which is the formation of a corrosion-resistant oxide layer that prevents further oxidation. Base metals differ chemically from noble metals in that their redox pairs have a negative instead of a positive standard electrode potential (relative to the standard hydrogen electrode).\n  \n "}
{"title": "Bauxites", "content": "Aluminum ore\n Bauxite is an aluminum ore that consists primarily of the aluminum materials gibbsite (hydrargillite) and diaspore as well as the iron oxides haematite and goethite, the clay material kaolinite and small amounts of the titanium oxide anatase. Bauxite takes its name from its place of discovery, Les Baux-de-Provence in the south of France. It was unearthed there for the first time in 1821.\n  \n "}
{"title": "Belt conveyor", "content": "A system for moving bulk material\n Machine which transports bulk material on a conveyor belt. It mainly consists of a supporting structure made of steel sections, a drive station, a return station, idlers and a conveyor belt.\n  \n "}
{"title": "Belt wagons", "content": "Mining technology\n  \n "}

在大模型语料的采集中,确保语料的高质量是至关重要的。因此,在使用爬虫获取数据时,我们需要灵活运用XPath基本知识,进行节点转换,准确提取有效信息,剔除广告和异常字符,同时将图片和视频链接转换为模型可识别的格式等操作。这确保了我们获取的数据是干净、有用的,为大模型训练提供保障。

这里分享了一个XPath节点用法的简单案例。后续的篇章中,我们将深入探讨XPath的高级应用,涵盖节点转换、图片视频链接识别、节点合并等更为复杂而强大的操作!🌐✨

如果您发现博客对您的研究有用,请记得点赞关注收藏!

> 🚀💻 欢迎一起探索Python的更多精彩博客!🌟  
>AI聊天伴侣的语料采集大揭秘:OpenCV如何轻松识别聊天图片?-CSDN博客
> 百万youtube高清视频数据集采集_youtube数据集-CSDN博客

>如何爬取股票动态图形数据——Echart提示框数据_echarts 股票-CSDN博客
> 一起来学习吧!😄🎉 #编程 #CSDN #Python #Pandas #Astropy #Pydub 🎓💡🎯

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1320638.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Backtrader 文档学习-Platform Concepts

Backtrader 文档学习-Platform Concepts 1.开始之前 导入backtrader &#xff0c;以及backtrader 的指示器、数据反馈的模块 。 import backtrader as bt import backtrader.indicators as btind import backtrader.feeds as btfeeds看看btind模块下有什么方法和属性&#x…

redis之五种基本数据类型

redis存储任何类型的数据都是以key-value形式保存&#xff0c;并且所有的key都是字符串&#xff0c;所以讨论基础数据结构都是基于value的数据类型 常见的5种数据类型是&#xff1a;String、List、Set、Zset、Hash 一) 字符串(String) String是redis最基本的类型&#xff0c;v…

Let‘s EasyV|年度作品征集:让数字孪生 看见设计的力量

转眼2023 已然迎来岁末&#xff0c;在这一年以来&#xff0c;关于「数字孪生」、关于「可视化」在遇到各类挑战的同时也在持续进化。在 2023 年&#xff0c;有越来越多人了解并投身于数字孪生可视化相关行业中来&#xff0c;持续制作打磨出更多更精彩的设计作品&#xff0c;并真…

ELF文件结构

ELF文件结构 前文结尾说到编译器编译源代码后生成的文件叫做目标文件&#xff0c;而目标文件经过编译器链接之后得到的就是可执行文件。那么目标文件到底是什么&#xff1f;它和可执行文件又有什么区别&#xff1f;链接到底又做了什么呢&#xff1f;接下来&#xff0c;我们将探…

产品调研——AI平台

本文主要记录了对腾讯云-TIONE平台、华为云-ModelArt等主流AI平台的产品调研。 交互式建模 简单点说就是提供了带训练资源的云IDE&#xff0c;使用形态包括Notebook、VsCode等。 腾讯云-TI平台 TI平台将tensorflow、pytorch、spark环境等均集成到一个Notebook容器中&#xf…

探索“超级服务器” TON:SDK 应用与开发入门

TON 是一个由多个组件构成的去中心化和开放的互联网平台&#xff0c;聚焦于实现广泛的跨链互操作性&#xff0c;同时在高可扩展性的安全框架中运作。TON 区块链被设计为分布式超级计算机或“超级服务器&#xff08;superserver&#xff09;”&#xff0c;旨在提供各种产品和服务…

分布式事务--TC服务的高可用和异地容灾

1.模拟异地容灾的TC集群 计划启动两台seata的tc服务节点&#xff1a; 节点名称ip地址端口号集群名称seata127.0.0.18091SHseata2127.0.0.18092HZ 之前我们已经启动了一台seata服务&#xff0c;端口是8091&#xff0c;集群名为SH。 现在&#xff0c;将seata目录复制一份&…

屏幕超时休眠-Android13

屏幕超时休眠-Android13 1、设置界面1.2 属性值1.2.1 默认值1.2.2 最小值限制 1.3 属性值疑问 Settings.System.SCREEN_OFF_TIMEOUT 2、超时灭屏2.1 锁定屏幕的超时2.2 屏幕灭屏的超时 3、永不休眠* 关键日志 1、设置界面 packages/apps/Settings/src/com/android/settings/dis…

(已解决)如何使用matplotlib绘制小提琴图

网上很多人使用seaborn绘制小提琴图&#xff0c;本人暂时不想学新的东西&#xff0c;就是懒。本文介绍如何使用matplotlib绘制小提琴图&#xff0c;很多其他博客只是使用最简单的语法&#xff0c;默认小提琴颜色会是蓝色&#xff0c;根本改不了。本文使用了一点高级的用法&…

深度学习环境配置超详细教程【Anaconda+Pycharm+PyTorch(GPU版)+CUDA+cuDNN】

在宇宙的浩瀚中&#xff0c;我们是微不足道的&#xff0c;但我们的思维却可以触及无尽的边界。 目录 关于Anaconda&#xff1a; 关于Pycharm&#xff1a; 关于Pytorch&#xff1a; 关于CUDA&#xff1a; 关于Cudnn&#xff1a; 一、&#x1f30e;前言&#xff1a; 二、&…

如何用 Cargo 管理 Rust 工程系列 戊

以下内容为本人的学习笔记&#xff0c;如需要转载&#xff0c;请声明原文链接 微信公众号「ENG八戒」https://mp.weixin.qq.com/s/-OiWtUCUc3FmKIGMBEYfHQ 单元和集成测试 Rust 为单元测试提供了非常好的原生支持。 创建库工程时&#xff0c;cargo 生成的源码文件 lib.rs 自带…

32、应急响应——linux

文章目录 一、linux进程排查二、linux文件排查三、linux用户排查四、linux持久化排查4.1 历史命令4.2 定时任务排查4.3 开机启动项排查 五、linux日志分析六、工具应用 一、linux进程排查 查看资源占用&#xff1a;top查看所有进程&#xff1a;ps -ef根据进程PID查看进程详细信…

不做数据采集,不碰行业应用,专注数字孪生PaaS平台,飞渡科技三轮融资成功秘诀

12月15日&#xff0c;飞渡科技在北京举行2023年度投资人媒体见面会&#xff0c;全面分享其产品技术理念与融资之路。北京大兴经开区党委书记、管委会主任常学智、大兴经开区副总经理梁萌、北京和聚百川投资管理有限公司&#xff08;以下简称“和聚百川”&#xff09;投资总监严…

头部游戏厂商鸿蒙合作,开发岗又‘缺人‘

12月18日&#xff0c;米哈游宣布将基于HarmonyOS NEXT启动鸿蒙原生应用开发&#xff0c;成为又一家启动鸿蒙原生应用开发的头部游戏厂商。 作为一家创立于2011年的科技型文创企业&#xff0c;上海米哈游网络科技股份有限公司推出了众多高品质人气产品&#xff0c;其中包括《崩坏…

Meta与Ray-Ban合作推出了一款全新智能眼镜外观时尚,而且搭载了能够“看到“你所看到的一切的人工智能技术

每周跟踪AI热点新闻动向和震撼发展 想要探索生成式人工智能的前沿进展吗&#xff1f;订阅我们的简报&#xff0c;深入解析最新的技术突破、实际应用案例和未来的趋势。与全球数同行一同&#xff0c;从行业内部的深度分析和实用指南中受益。不要错过这个机会&#xff0c;成为AI领…

adb详细教程(五)-复制文件、截屏、录屏

adb对于安卓移动端来说&#xff0c;是个非常重要的调试工具。在进行安卓端的开发或测试过程中&#xff0c;有时需要了截屏或录屏&#xff0c;在设备上操作完成后再将文件导入电脑非常繁琐。​如果使用adb指令在进行截屏或录屏则会便捷许多。此篇文章介绍了如何使用adb指令进行文…

LLMs 玩狼人杀:清华大学验证大模型参与复杂交流博弈游戏的能力

作者&#xff1a;彬彬 编辑&#xff1a;李宝珠&#xff0c;三羊 清华大学研究团队提出了一种用于交流游戏的框架&#xff0c;展示了大语言模型从经验中学习的能力&#xff0c;还发现大语言模型具有非预编程的策略行为&#xff0c;如信任、对抗、伪装和领导力。 近年来&#x…

React系列:配置@别名路径并配置联想

&#x1f341; 作者&#xff1a;知识浅谈&#xff0c;CSDN签约讲师&#xff0c;CSDN博客专家&#xff0c;华为云云享专家&#xff0c;阿里云专家博主 &#x1f4cc; 擅长领域&#xff1a;全栈工程师、爬虫、ACM算法 &#x1f492; 公众号&#xff1a;知识浅谈 &#x1f525;网站…

【NI-RIO入门】扫描模式

于NI KB摘录 所有CompactRIO设备都可以访问CompactRIO扫描引擎和LabVIEW FPGA。 CompactRIO 904x 系列是第一个引入 DAQmx 功能的产品线。 扫描引擎&#xff08;IO 变量&#xff09; – 主要为迁移和初始开发而设计。控制循环频率高达 1 kHz1&#xff0c;性能控制器上的频率更…

mysql使用全文索引+ngram全文解析器进行全文检索

表结构&#xff1a;表名 gamedb 主键 id 问题类型 type 问题 issue 答案 answer 需求 现在有个游戏资料库储存在mysql中&#xff0c;客户端进行搜索&#xff0c;需要对三个字段进行匹配&#xff0c;得到三个字段的相关性&#xff0c;选出三个字段中相关性最大的值进…