目录
一、python爬虫解析—xpath
XPath基础知识:
XPath语法:
XPath轴:
XPath谓词:
XPath函数:
当使用XPath解析HTML或XML文档时,可以使用一些常见的XPath语法示例来选择元素和属性。以下是一些常见的XPath语法示例:
假设我们有以下HTML文档作为示范:
总结
hello,小伙伴们。我是喔的嘛呀。今天我们一起来学习python爬虫解析—xpath。准备好了吗?
一、python爬虫解析—xpath
XPath(XML Path Language)是一种强大的查询语言,用于在XML或HTML文档中导航和选择元素。它通常用于网页抓取,用于定位网页上的特定元素。在Python中,您可以使用**lxml
**库,它提供了一个XPath解析器和评估器。
XPath基础知识:
- XPath表达式用于在XML或HTML文档中导航元素和属性。
- XPath使用路径表达式来选择XML或HTML文档中的节点或节点集。
- XPath中的节点可以通过其名称、属性或在文档树中的位置来选择。
XPath语法:
- XPath表达式以字符串形式编写,并传递给**
lxml
元素的xpath()
**方法。 - XPath表达式的基本语法是**
/路径/到/元素
**。 - 开头的**
/
**表示文档的根节点。 - 元素名称用**
/
**分隔,表示层次结构。
XPath轴:
- 轴用于定义相对于XPath表达式中当前节点的上下文节点。
- 常用的轴包括**
child::
、parent::
、following-sibling::
、preceding-sibling::
、ancestor::
和descendant::
**。
XPath谓词:
- 谓词用于根据特定条件过滤节点。
- 它们用方括号**
[]
括起来,可以包含条件,如@attribute='value'
或position()=1
**。
XPath函数:
- XPath提供各种函数来操作字符串、数字和其他数据类型。
- 函数的示例包括**
contains()
、starts-with()
、concat()
、text()
、last()
、position()
和count()
**。
当使用XPath解析HTML或XML文档时,可以使用一些常见的XPath语法示例来选择元素和属性。以下是一些常见的XPath语法示例:
- 选择所有元素:
//*
:选择文档中的所有元素。
- 选择特定元素:
//tagname
:选择文档中所有具有给定名称的元素。//tagname[@attribute='value']
:选择具有指定属性值的特定元素。
- 选择子元素:
//parent/child
:选择指定父元素下的子元素。
- 选择父元素:
//child/..
:选择元素的父元素。
- 选择同级元素:
//element/following-sibling::sibling
:选择元素后面的同级元素。//element/preceding-sibling::sibling
:选择元素前面的同级元素。
- 选择属性:
//@attribute
:选择所有具有指定属性的元素。
- 选择文本内容:
//element/text()
:选择元素的文本内容。//element/@attribute
:选择元素的特定属性值。
- 使用通配符:
//element[*]
:选择具有任意子元素的元素。//element[@*]
:选择具有任意属性的元素。
- 使用逻辑运算符:
//element[@attribute='value' and @attribute2='value2']
:选择具有多个属性的元素。
这些示例展示了XPath语法的一些常见用法,您可以根据具体情况修改XPath表达式以满足您的需求。
假设我们有以下HTML文档作为示范:
<html>
<head>
<title>示例</title>
</head>
<body>
<div class="content">
<h1>标题</h1>
<p>段落1</p>
<p>段落2</p>
<a href="<https://example.com>">链接</a>
</div>
<div class="sidebar">
<h2>侧边栏标题</h2>
<ul>
<li>项目1</li>
<li>项目2</li>
<li>项目3</li>
</ul>
</div>
</body>
</html>
我们可以使用XPath来选择并提取这些元素。以下是一些基本的XPath示例:
- 选择所有段落元素(<p>):
//p
2、选择所有具有特定class属性的段落元素:
//p[@class='content']
3、选择所有链接元素(<a>)的文本内容:
//a/text()
4、选择所有侧边栏项目(<li>)的文本内容:
//div[@class='sidebar']//li/text()
5、选择所有标题元素(<h1>和<h2>)的文本内容:
//h1/text() | //h2/text()
6、选择第一个段落元素(<p>)的文本内容:
//p[1]/text()
这些示例展示了如何使用XPath选择和提取HTML文档中的不同元素和内容。在实际应用中,您可以根据需要构建更复杂的XPath表达式来满足您的数据提取需求。
案列:
1、下面是一个使用XPath从淘票票网站获取电影名称和评分的Python示例代码。
首先看网页结构:
div class="center-wrap" data-spm="w2">
<div class="tab-control tab-movie-tit">
<a class="tab-control-item current" href="#">正在热映(65)</a>
<a class="tab-control-item" href="#">即将上映(106)</a>
<a class="more" href="<https://dianying.taobao.com/showList.htm?n_s=new>">查看全部 ></a>
</div>
<div class="tab-content">
<!-- 正在热映 -->
<div class="tab-movie-list" style="display: block;">
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=513401&n_s=new&source=current>" class="movie-card">
<div class="movie-card-tag"><i class="t-201"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" data-src="<https://img.alicdn.com/bao/uploaded/i1/O1CN01WBx9mv1dKtVdfuoZN_!!6000000003718-0-alipicbeacon.jpg_160x240.jpg>" src="<https://img.alicdn.com/bao/uploaded/i1/O1CN01WBx9mv1dKtVdfuoZN_!!6000000003718-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">功夫熊猫4</span>
<span class="bt-r">9.1</span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:迈克·米切尔, 斯蒂芬妮·斯汀</span>
<span>主演:杰克·布莱克,黄渤,奥卡菲娜,杨幂,维奥拉·戴维斯,蒋欣</span>
<span>类型:动画,动作,冒险</span>
<span>地区:美国</span>
<span>语言:英语</span>
<span>片长:94分钟</span> </div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=513401&n_s=new>" class="movie-card-buy">选座购票</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1478900&n_s=new&source=current>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" data-src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01JfQQxY1xDNJakaXHZ_!!6000000006409-0-alipicbeacon.jpg_160x240.jpg>" src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01JfQQxY1xDNJakaXHZ_!!6000000006409-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">周处除三害</span>
<span class="bt-r">9.5</span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:黄精甫</span>
<span>主演:阮经天,袁富华,陈以文,王净,李李仁,谢琼煖</span>
<span>类型:动作,犯罪,悬疑</span>
<span>地区:中国台湾</span>
<span>语言:汉语普通话</span>
<span>片长:134分钟</span> </div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1478900&n_s=new>" class="movie-card-buy">选座购票</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1409686&n_s=new&source=current>" class="movie-card">
<div class="movie-card-tag"><i class="t-203"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" data-src="<https://img.alicdn.com/bao/uploaded/i4/O1CN01aT7ZNc1gJvVjTxn8k_!!6000000004122-0-alipicbeacon.jpg_160x240.jpg>" src="<https://img.alicdn.com/bao/uploaded/i4/O1CN01aT7ZNc1gJvVjTxn8k_!!6000000004122-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">沙丘2</span>
<span class="bt-r">9.3</span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:丹尼斯·维伦纽瓦</span>
<span>主演:提莫西·查拉梅,赞达亚,丽贝卡·弗格森,乔什·布洛林,奥斯汀·巴特勒,弗洛伦斯·皮尤,戴夫·巴蒂斯塔,克里斯托弗·沃肯,蕾雅·赛杜,斯特兰·斯卡斯加德,夏洛特·兰普林,哈维尔·巴登</span>
<span>类型:科幻,动作,冒险,剧情</span>
<span>地区:美国</span>
<span>语言:英语</span>
<span>片长:166分钟</span> </div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1409686&n_s=new>" class="movie-card-buy">选座购票</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1429194&n_s=new&source=current>" class="movie-card">
<div class="movie-card-tag"><i class="t-103"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" data-src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01s4djbH29FutyK4fzY_!!6000000008039-0-alipicbeacon.jpg_160x240.jpg>" src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01s4djbH29FutyK4fzY_!!6000000008039-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">哥斯拉大战金刚2:帝国崛起</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:亚当·温加德</span>
<span>主演:哥斯拉,金刚,丽贝卡·豪尔,布莱恩·泰里·亨利,丹·史蒂文斯,凯莉·霍特尔,艾利克斯·费恩,陈法拉,瑞切尔·豪斯</span>
<span>类型:动作,冒险,科幻</span>
<span>地区:美国</span>
<span>语言:英语</span>
<span>片长:114分钟</span> </div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1429194&n_s=new>" class="movie-card-buy">选座购票</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1510621&n_s=new&source=current>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" data-src="<https://img.alicdn.com/bao/uploaded/i2/O1CN01oBhfpu25YSToJdUnp_!!6000000007538-0-alipicbeacon.jpg_160x240.jpg>" src="<https://img.alicdn.com/bao/uploaded/i2/O1CN01oBhfpu25YSToJdUnp_!!6000000007538-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">灿烂的她</span>
<span class="bt-r">9.3</span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:徐伟</span>
<span>主演:惠英红,刘浩存,张子贤,刘欢,苇青,刘奕铁,胡宝森,廖银玥</span>
<span>类型:剧情,家庭</span>
<span>地区:中国大陆</span>
<span>语言:汉语普通话</span>
<span>片长:116分钟</span> </div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1510621&n_s=new>" class="movie-card-buy">选座购票</a>
</div>
</div>
<!-- 即将热映 -->
<div class="tab-movie-list">
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1509787&n_s=new&source=soon>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" src="<https://img.alicdn.com/bao/uploaded/i1/O1CN01B3BY1v1XZeZtnq0gf_!!6000000002938-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">国鼎魂(戏曲 苏剧)</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:汪灏, 蓝天</span>
<span>主演:王芳,张唐兵</span>
<span>类型:剧情,戏曲</span>
<span>地区:中国大陆</span>
<span>语言:吴语</span>
<span>片长:94</span>
</div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1509787&n_s=new&source=soon>" class="movie-card-soon">上映时间2024-03-25</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1429194&n_s=new&source=soon>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01s4djbH29FutyK4fzY_!!6000000008039-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">哥斯拉大战金刚2:帝国崛起</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:亚当·温加德</span>
<span>主演:哥斯拉,金刚,丽贝卡·豪尔,布莱恩·泰里·亨利,丹·史蒂文斯,凯莉·霍特尔,艾利克斯·费恩,陈法拉,瑞切尔·豪斯</span>
<span>类型:动作,冒险,科幻</span>
<span>地区:美国</span>
<span>语言:英语</span>
<span>片长:114</span>
</div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1429194&n_s=new&source=soon>" class="movie-card-soon">上映时间2024-03-29 09:00</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1460919&n_s=new&source=soon>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" src="<https://img.alicdn.com/bao/uploaded/i4/O1CN01QG9H8e1l1DPmBdsdG_!!6000000004758-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">坠落的审判</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:茹斯汀·特里耶</span>
<span>主演:桑德拉·惠勒,斯万·阿劳德,米洛·马查多·格拉内尔,安托万·赖纳茨,萨穆埃尔·泰斯,梅西,珍妮·贝丝</span>
<span>类型:剧情,家庭</span>
<span>地区:法国</span>
<span>语言:法语</span>
<span>片长:152</span>
</div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1460919&n_s=new&source=soon>" class="movie-card-soon">上映时间2024-03-29 18:00</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1458555&n_s=new&source=soon>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" src="<https://img.alicdn.com/bao/uploaded/i3/O1CN01DX7BQ81es9tOUrUgU_!!6000000003926-2-alipicbeacon.png_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">银河写手</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:李阔, 单丹丹</span>
<span>主演:宋木子,合文俊,李飞,李文茹,宋晓亮,张皓森,刘默然,祁又一</span>
<span>类型:喜剧,剧情</span>
<span>地区:中国大陆</span>
<span>语言:汉语普通话</span>
<span>片长:103</span>
</div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1458555&n_s=new&source=soon>" class="movie-card-soon">上映时间2024-03-30 09:00</a>
</div>
<div class="movie-card-wrap">
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1444866&n_s=new&source=soon>" class="movie-card">
<div class="movie-card-tag"><i class="t-"></i></div>
<div class="movie-card-poster">
<img width="160" height="224" src="<https://img.alicdn.com/bao/uploaded/i1/O1CN01aecCQf1MK7MyiR4mP_!!6000000001415-0-alipicbeacon.jpg_160x240.jpg>">
</div>
<div class="movie-card-name">
<span class="bt-l">我们一起摇太阳</span>
<span class="bt-r"></span>
</div>
<div class="movie-card-info">
<div class="movie-card-mask"></div>
<div class="movie-card-list">
<span>导演:韩延</span>
<span>主演:彭昱畅,李庚希</span>
<span>类型:爱情,剧情,家庭</span>
<span>地区:中国大陆</span>
<span>语言:汉语普通话</span>
<span>片长:129</span>
</div>
</div>
</a>
<a href="<https://dianying.taobao.com/showDetail.htm?showId=1444866&n_s=new&source=soon>" class="movie-card-soon">上映时间2024-03-30 10:00</a>
</div>
</div>
</div>
</div>
根据网页结构写爬虫代码:
from lxml import etree
import requests
url = "<https://dianying.taobao.com/showList.htm?n_s=new>"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"}
response = requests.get(url, headers=headers)
html = response.content.decode('utf-8')
tree = etree.HTML(html)
movies = tree.xpath('//div[@class="tab-movie-list"][1]//div[@class="movie-card-wrap"]')
if not movies:
print("No movies found.")
else:
for movie in movies:
name = movie.xpath('.//span[@class="bt-l"]/text()')[0]
try:
score = movie.xpath('.//span[@class="bt-r"]/text()')[0]
except IndexError:
score = "N/A"
print(f"电影:{name},评分:{score}")
注意:"评分:N/A"表示评分未提供或不可用。在您的电影列表中,有些电影没有可用的评分数据,可能是因为该网站上还没有人为该电影提供评分,或者评分信息尚未更新。这种情况下,您的脚本正确地显示为“N/A”,表示无法获取评分数据。
结果展示:
对比了一下没有什么问题
2、我们可以以一个典型的新闻网站为例,尝试提取新闻标题、摘要和链接。假设目标网站的HTML结构如下所示:
<div class="news-list">
<div class="news-item">
<h2 class="news-title"><a href="news1.html">新闻标题1</a></h2>
<p class="news-summary">新闻摘要1</p>
</div>
<div class="news-item">
<h2 class="news-title"><a href="news2.html">新闻标题2</a></h2>
<p class="news-summary">新闻摘要2</p>
</div>
<div class="news-item">
<h2 class="news-title"><a href="news3.html">新闻标题3</a></h2>
<p class="news-summary">新闻摘要3</p>
</div>
<!-- 更多新闻项 -->
</div>
我们可以使用以下Python代码来提取这些信息:
import requests
from lxml import html
url = '<https://example.com/news>'
response = requests.get(url)
tree = html.fromstring(response.content)
# 使用XPath选择新闻项
news_items = tree.xpath("//div[@class='news-list']/div[@class='news-item']")
for item in news_items:
# 提取新闻标题
title = item.xpath(".//h2[@class='news-title']/a/text()")[0]
# 提取新闻摘要
summary = item.xpath(".//p[@class='news-summary']/text()")[0]
# 提取新闻链接
link = item.xpath(".//h2[@class='news-title']/a/@href")[0]
# 打印新闻信息
print(f"标题: {title}\\n摘要: {summary}\\n链接: {link}\\n")
总结
XPath在Python爬虫中是一种强大的工具,能够有效地定位和提取网页中的数据。通过学习和掌握XPath的基本语法和常用用法,可以更轻松地编写出高效的爬虫程序。在使用XPath时,建议先使用浏览器的开发者工具来辅助查找和验证XPath表达式,以提高开发效率。
要成功爬取数据,必须要熟悉前段结构。爬取数据前必须先要观察分析前段结构,这样才能提高我们的效率,精准爬取。
怎么样是不是很有趣,如果有兴趣的话就跟我一快学习吧。