Scrapy与分布式开发(2.3)：lxml+xpath基本指令和提取方法详解

lxml+xpath基本指令和提取方法详解

一、XPath简介

XPath，全称为XML Path Language，是一种在XML文档中查找信息的语言。它允许用户通过简单的路径表达式在XML文档中进行导航。XPath不仅适用于XML，还常用于处理HTML文档。

二、基本指令和提取方法

选择节点

使用XPath，你可以轻松地选择XML文档中的节点。
* 选择根节点：/
* 选择子节点：/parent/child
* 选择所有节点：//*
* 后代节点选择：使用//descendant选择文档中的任意后代节点，无论层级。
* 相邻节点选择：使用/sibling1/following-sibling::sibling2选择相邻的同级节点。

使用轴

XPath提供了多种轴，允许你基于节点之间的关系进行选择。
* 子轴：/parent/child
* 同胞轴：/parent/child1/following-sibling::child2
* 属性轴：/parent/child/@attribute

使用谓语

谓语用于过滤节点集，帮助你更精确地定位节点。
* 选择第一个节点：/parent/child[1]
* 选择具有特定值的节点：/parent/child[@attribute='value']
* 选择多个满足条件的节点：/parent/child[position() > 1]
* 使用/parent/child/@attribute直接选择属性节点。
* 使用/parent/child[position()]根据节点在父节点下的位置进行选择。例如，[1]表示第一个子节点，[last()]表示最后一个子节点。
* 使用/parent/child[text()='value']选择文本内容等于特定值的节点。
* 使用and、or进行多条件选择，如/parent/child[@attribute1='value1' and @attribute2='value2']。

提取加粗样式文本

XPath不仅可以定位节点，还可以提取节点的文本内容。
* 使用text()函数提取节点的文本内容，如/parent/child/text()。
* 使用string()函数提取节点的字符串表示，适用于复杂节点结构。
* 直接使用/@attribute提取节点的属性值，如/parent/child/@attribute。
* 使用逗号,分隔多个XPath表达式，一次性提取多个节点或属性，如/parent/(child1, child2, @attribute)。
* 使用.表示当前节点及其所有子节点，如node()函数。

三、实例演示

下面是一些XPath查询的实例，演示了如何使用XPath来提取XML文档中的数据。

XML文档示例：

<bookstore>
  <book>
    <title lang="en">Harry Potter</title>
    <author>J.K. Rowling</author>
    <price>29.99</price>
  </book>
  <book>
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
    <price>39.95</price>
  </book>
  <book>
    <title lang="zh-CN">西游记</title>
    <author>吴承恩</author>
    <price>28.80</price>
  </book>
</bookstore>

选择所有书名
XPath表达式：/bookstore/book/title
结果：<title lang="en">Harry Potter</title>, <title lang="en">Learning XML</title>, <title lang="zh-CN">西游记</title>

选择第二本书的价格
XPath表达式：/bookstore/book[2]/price
结果：<price>39.95</price>

选择所有英文书名
XPath表达式：/bookstore/book/title[@lang='en']
结果：<title lang="en">Harry Potter</title>, <title lang="en">Learning XML</title>

选择价格高于30的所有书籍
XPath表达式：/bookstore/book[price > 30]
结果：<book>...</book>（包含Learning XML这本书的信息）

选择所有书籍的作者名字
XPath表达式：/bookstore/book/author/text()
结果：J.K. Rowling, Erik T. Ray, 吴承恩

选择第一本书的标题文本
XPath表达式：/bookstore/book[1]/title/text()
结果：Harry Potter

选择所有书籍的价格（作为文本）
XPath表达式：/bookstore/book/price/text()
结果：29.99, 39.95, 28.80

选择所有具有属性的title节点
XPath表达式：//title[@*]
结果：所有带有属性的<title>节点，如<title lang="en">Harry Potter</title>

提取多个节点并返回其文本
XPath 表达式：/bookstore/book/(title/text(), author/text())
结果：对于每一本书，返回其标题和作者的文本内容，例如第一本书返回 ("Harry Potter", "J.K. Rowling")。

提取节点的直接子节点
XPath 表达式：/bookstore/book/price
结果：返回所有<price>节点，因为<price>是<book>的直接子节点。

提取节点的所有子节点
XPath 表达式：/bookstore/book/*
结果：对于每一本书，返回其所有子节点，即<title>, <author>, 和 <price>。

提取节点的属性
XPath 表达式：/bookstore/book/title/@lang
结果：返回所有<title>节点的lang属性值，例如"en"和"zh-CN"。

提取节点的父节点
XPath 表达式：/bookstore/book/price/parent::book
结果：返回每个<price>节点的父节点<book>。

提取节点的前一个或后一个同级节点
XPath 表达式：/bookstore/book[2]/title/previous-sibling::title 和 /bookstore/book[2]/title/next-sibling::title
结果：分别返回第二本书标题的前一个和后一个同级标题节点（在这个例子中，因为第二本书是第一个，所以前一个同级节点不存在，后一个同级节点是第三本书的标题）。

提取节点的祖先节点
XPath 表达式：/bookstore/book/title/ancestor::bookstore
结果：返回每个<title>节点的祖先<bookstore>节点。

提取节点及其所有后代节点
XPath 表达式：/bookstore/book[1]
结果：返回第一本书及其所有后代节点，即完整的第一本书的信息。

提取满足条件的节点集合
XPath 表达式：/bookstore/book[price > 30]
结果：返回价格大于30的所有<book>节点。

四、lxml应用xpath

在Python中，lxml是一个功能强大的库，用于解析XML和HTML文档。结合XPath，我们可以轻松地定位和提取文档中的特定信息。下面是一个关于如何使用lxml和XPath进行XML解析和数据提取的详细讲解，重点在于提供实用指令和文本提取方法。

安装lxml

首先，确保你已经安装了lxml库。如果没有，可以通过pip进行安装：

pip install lxml

加载XML文档

使用lxml的etree模块加载XML文档：

from lxml import etree
# 加载XML文档
tree = etree.parse('example.xml')

使用XPath提取数据

选择节点
选择所有<book>节点：

books = tree.xpath('/bookstore/book')

选择特定节点
选择第一个<book>节点：

first_book = tree.xpath('/bookstore/book[1]')

选择节点属性
选择所有<book>节点的title属性值：

titles = tree.xpath('/bookstore/book/title/@lang')

选择节点的文本内容
选择所有<title>节点的文本内容：

titles_text = tree.xpath('/bookstore/book/title/text()')

选择多个节点及其文本内容
选择所有<book>节点的<title>和<author>文本内容：

books_info = tree.xpath('/bookstore/book/(title/text(), author/text())')

条件选择
选择价格大于30的<book>节点：

expensive_books = tree.xpath('/bookstore/book[price > 30]')

选择后代节点
选择所有<price>后代节点：

prices = tree.xpath('//price')

实战演示

案例一：提取博客文章标题

from lxml import etree  
  
# 假设html_content是博客网页的HTML内容  
html_content = """  
<html>  
<head>  
    <title>My Blog</title>  
</head>  
<body>  
    <h1>Welcome to My Blog</h1>  
    <div class="post">  
        <h2>Article 1 Title</h2>  
        <p>Article 1 content...</p>  
    </div>  
    <div class="post">  
        <h2>Article 2 Title</h2>  
        <p>Article 2 content...</p>  
    </div>  
</body>  
</html>  
"""  
  
# 解析HTML  
tree = etree.HTML(html_content)  
  
# 使用XPath定位所有<h2>元素并提取文本内容  
article_titles = tree.xpath('//h2/text()')  
  
# 打印文章标题  
for title in article_titles:  
    print(title.strip())  # 使用strip()移除可能存在的空白字符

案例二：提取链接和链接文本

from lxml import etree  
  
html_content = """  
<html>  
<head>  
    <title>Links Page</title>  
</head>  
<body>  
    <p>Here are some links:</p>  
    <ul>  
        <li><a href="https://example.com/link1">Link 1</a></li>
<li><a href="https://example.com/link2">Link 2</a></li>  
        <li><a href="https://example.com/link3">Link 3</a></li>  
    </ul>  
</body>  
</html>  
"""  
  
# 解析HTML  
tree = etree.HTML(html_content)  
  
# 使用XPath提取所有链接和链接文本  
links = tree.xpath('//a')  
for link in links:  
    link_text = link.text.strip()  # 提取链接文本并移除空白字符  
    link_href = link.get('href')  # 提取href属性  
    print(f"Link Text: {link_text}, Link: {link_href}")

案例三：提取链接和链接文本

from lxml import etree  
  
html_content = """  
<html>  
<head>  
    <title>Table Page</title>2</th>  
            <th>Header 3</th>  
        </tr>  
        <tr>  
            <td>Row 1, Col 1</td>  
            <td>Row 1, Col 2</td>  
            <td>Row 1, Col 3</td>  
        </tr>  
        <tr>  
            <td>Row 2, Col 1</td>  
            <td>Row 2, Col 2</td>  
            <td>Row 2, Col 3</td>  
        </tr>  
    </table>  
</body>  
</html>  
"""  
  
# 解析HTML  
tree = etree.HTML(html_content)  
  
# 使用XPath提取表格的所有行  
table_rows = tree.xpath('//table/tr')  
  
# 遍历行并提取单元格数据  
for row in table_rows:  
    # 提取单元格数据，这里假设所有行都有相同数量的列  
    cells = row.xpath('td|th')  
    row_data = [cell.text.strip() for cell in cells]  
    print(row_data)

注意事项

XPath表达式是大小写敏感的，确保你的标签名与XML文档中的大小写一致。
如果XML文档中有命名空间，你可能需要在XPath表达式中处理它们。

经验之谈

借用浏览器快速获取xpath指令

打开浏览器进入开发者模式，选定要提取的位置，然后右键按下图流程处理即可快速获取该位置的xpath选择命令
在这里插入图片描述

XPath Helper

浏览器插件XPath Helper可以让我们直观看到自己的选择命令是不是合理的
在这里插入图片描述

代码提取不到但是浏览器可以？

有时候会出现明明浏览器直接copy的指令，或者我们通过浏览器确定是可以的指令，但是在代码执行却提取失败，这种常见的可能性是：网页返回的html文本结构是A，但是经过浏览器渲染后变成了B，这让我们用B的指令去提取A，肯定得不到结果，这种在表格中比较常见，特别table > tbody > tr这一层，如果网页本身没有tbody，浏览器一般会自动渲染上。
解决方法：