最近采集大模型的某领域的专业术语语料,却发现一些网站的专业术语和名称解释深藏在一个 <div>
节点下的多个同级 <p>
节点中。这样的结构让爬虫在使用 .//text()
后获取文本时变得十分头疼,难以准确区分是术语还是解释。😓。
不过,别担心!在这里,我将介绍一个简单而有效的案例,借助XPath的高级技巧,精准地获取同级 <p>
节点的不同信息。让我们一起来看看吧!🕵️♂️
目标网站:Mining from A to Z - BEUMER Group
上述图片中的矿业专业术语和解释都处在同级的p节点中,但我们可以发现专业术语明显是使用了加粗字体,所以专业术语会在<strong>节点中,根据这种思路我们可以遍历判断是否为strong节点就能区分是否为术语和解释了。
话不多说,上代码:
import requests, json
from lxml import etree
response = requests.get('https://www.beumergroup.com/knowledge/mining/our-mining-glossary/' )#, cookies=cookies, headers=headers)
html = etree.HTML(response.text, etree.HTMLParser())
res = html.xpath('//div[@class="content content--textLeft"]/p')
data = {'title':None, "content":''}
for p in res:
tags = p.xpath('./*')
if not tags and p.xpath('./text()'):
data['content'] += p.xpath('./text()')[0] + '\n '
for tag in tags:
print("tag",tag.tag)
if tag.tag == 'strong':
if data.get("title"):
with open("beumergroup.jsonl", 'a') as f:
f.write(json.dumps(data, ensure_ascii=False) + '\n')
data = {'title':None, "content":''}
data['title'] = tag.xpath('./text()')[0]
print(tag.xpath('./text()'))
else:
text = tag.xpath('./text()')
if text:
data['content'] += text[0] + '\n '
print(data)
这段代码主要通过 res = html.xpath('//div[@class="content content--textLeft"]/p')获取了包含专业术语和解释的所有p节点,通过遍历所有的p节点,来判断节点是否为strong节点,如果是strong节点就把它放进title字段中,不为strong节点就放进content字段中,最终输出结构如下:
{"title": "Amphibolites", "content": "Rock\n Amphibolite is a metamorphic rock. The main components are amphiboles (mostly hornblende) and plagioclase. Quartz, garnet, diopside, epidote and biotite may also be present. The chemical composition of the amphibolites is metabasic. They are obtained from basic magmatites such as gabbros, basalts, andesites and their tuffs or from marls and tuffites (para-amphibolites)\n \n "}
{"title": "Angle of repose (slope)", "content": "Bulk materials property\n For cohesion-free, free-flowing, grainy materials such as grain, granulated mineral fertilizer, limestone, pellets, coke, etc., the natural angle of repose – also known as the angle of slope – is the maximum angle at which individual surface particles stop sliding down. The natural angle of repose can be easily determined by allowing the bulk material to flow from the base area of a cylinder-shaped sampler at low tipping height onto a firm horizontal support.\n \n "}
{"title": "Anthracites", "content": "Bituminous coal\n Anthracite is a bituminous coal with a volatile content of less than ten percent. This high-grade form of coal is extremely hard. Anthracite forms from vegetable matter under high pressure and in the absence of air. This increases the carbon content. This lies at over 91.5 percent by weight. Anthracite is particularly prized as a fuel due to its high energy content, the hot flame it produces and its combustion properties which mean that it leaves almost no residue.\n "}
{"title": "Backfill", "content": "Cavity filling\n In mining, the term backfilling refers to the filling-in of the cavity between the excavated area and the rock mass using suitable materials. Various materials (such as gravel) and a number of different technologies are used for backfilling.\n \n "}
{"title": "Ball mill", "content": "Grinding of ore\n A ball mill is a mill for ultrafine grinding or homogenization. It consists of a rotating grinding jar (steel cylinder) into which the material to be ground (ore) is fed. Grinding balls made of various materials are added – in the case of ore, the balls are made of steel. Rotation of the ball mill pulverizes the ore.\n \n "}
{"title": "Banded iron formation", "content": "Sedimentary rock\n Banded iron formations are units of iron-bearing marine sedimentary rock that were primarily deposited in the Precambrian period. They possess a characteristic banded structure due to their metal-bearing layers. The layers primarily consist of ferrous minerals with a vertical cross-section reminiscent of bands. This is what gives them their name.\n \n "}
{"title": "Basalt", "content": "Igneous rock\n Basalt is a basic igneous rock. It consists in particular of iron and magnesium silicates with pyroxenes and calcium-rich feldspar (plagioclase) and also usually also olivine. Basalt is the volcanic equivalent of gabbro (plutonite), which has the same chemical composition. Basalt is usually dark gray to black. Because the rock is formed through the action of volcanic processes, the groundmass is usually fine-grained due to the rapid cooling it undergoes.\n \n "}
{"title": "Base metal", "content": "Base metal\n Base metals are characterized by the fact that they oxidize – under normal conditions they react with oxygen in the air. For example, iron rusts. Zink and aluminum protect themselves through passivation, which is the formation of a corrosion-resistant oxide layer that prevents further oxidation. Base metals differ chemically from noble metals in that their redox pairs have a negative instead of a positive standard electrode potential (relative to the standard hydrogen electrode).\n \n "}
{"title": "Bauxites", "content": "Aluminum ore\n Bauxite is an aluminum ore that consists primarily of the aluminum materials gibbsite (hydrargillite) and diaspore as well as the iron oxides haematite and goethite, the clay material kaolinite and small amounts of the titanium oxide anatase. Bauxite takes its name from its place of discovery, Les Baux-de-Provence in the south of France. It was unearthed there for the first time in 1821.\n \n "}
{"title": "Belt conveyor", "content": "A system for moving bulk material\n Machine which transports bulk material on a conveyor belt. It mainly consists of a supporting structure made of steel sections, a drive station, a return station, idlers and a conveyor belt.\n \n "}
{"title": "Belt wagons", "content": "Mining technology\n \n "}
在大模型语料的采集中,确保语料的高质量是至关重要的。因此,在使用爬虫获取数据时,我们需要灵活运用XPath基本知识,进行节点转换,准确提取有效信息,剔除广告和异常字符,同时将图片和视频链接转换为模型可识别的格式等操作。这确保了我们获取的数据是干净、有用的,为大模型训练提供保障。
这里分享了一个XPath节点用法的简单案例。后续的篇章中,我们将深入探讨XPath的高级应用,涵盖节点转换、图片视频链接识别、节点合并等更为复杂而强大的操作!🌐✨
如果您发现博客对您的研究有用,请记得点赞关注收藏!
> 🚀💻 欢迎一起探索Python的更多精彩博客!🌟
>AI聊天伴侣的语料采集大揭秘:OpenCV如何轻松识别聊天图片?-CSDN博客
> 百万youtube高清视频数据集采集_youtube数据集-CSDN博客
>如何爬取股票动态图形数据——Echart提示框数据_echarts 股票-CSDN博客
> 一起来学习吧!😄🎉 #编程 #CSDN #Python #Pandas #Astropy #Pydub 🎓💡🎯