Python爬虫技术第13节 HTML和CSS选择器

news2025/4/12 9:29:00

在爬虫技术中，解析和提取网页数据是核心部分。HTML 和 CSS 选择器被广泛用于定位网页中的特定元素。下面将详细介绍这些选择器如何在 Python 中使用，特别是在使用像 Beautiful Soup 或 Scrapy 这样的库时。

在这里插入图片描述

HTML 选择器

HTML 选择器基于 HTML 元素的属性来定位和选择页面上的元素。以下是一些常见的 HTML 选择器类型：

标签选择器：
- soup.find_all('div') 将返回页面上所有的 <div> 标签。
类选择器：
- soup.find_all(class_='classname') 可以找到所有 class 属性为 classname 的元素。
ID 选择器：
- soup.find(id='uniqueid') 可以找到 ID 为 uniqueid 的元素。
属性选择器：
- soup.find_all(attrs={'data-type': 'value'}) 可以找到具有指定属性和值的所有元素。
组合选择器：
- soup.select('div p') 使用 CSS 选择器语法来查找所有位于 <div> 内部的 <p> 标签。

CSS 选择器

CSS 选择器提供了更复杂的选择能力，它们可以基于元素的关系、位置和状态来选择元素。以下是 CSS 选择器的一些示例：

子选择器：
- #parent > .child 只选择直接作为 #parent 子元素的 child 类元素。
后代选择器：
- .container .item 选择所有在 .container 类内的 .item 类元素，无论嵌套多深。
相邻兄弟选择器：
- h1 + p 选择紧跟在 h1 元素后的 p 元素。
一般兄弟选择器：
- h1 ~ p 选择同级的 h1 元素之后的所有 p 元素。
伪类选择器：
- a:hover 选择鼠标悬停状态下的链接。

在 Python 中，Beautiful Soup 库使用 .select() 方法来解析 CSS 选择器，而 Scrapy 提供了 .css() 方法来实现类似功能。

示例代码

使用 Beautiful Soup：

from bs4 import BeautifulSoup
import requests

response = requests.get('http://example.com')
soup = BeautifulSoup(response.text, 'html.parser')

# 使用标签选择器
divs = soup.find_all('div')

# 使用类选择器
classname_elements = soup.find_all(class_='classname')

# 使用 CSS 选择器
container_items = soup.select('.container .item')

使用 Scrapy：

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def parse(self, response):
        divs = response.css('div')
        classname_elements = response.css('.classname')
        container_items = response.css('.container .item')

以上就是使用 HTML 和 CSS 选择器在 Python 爬虫中抓取数据的基本方法。根据实际需求，你可以结合多种选择器来精确地定位和提取所需信息。

当然，让我们通过一些具体的例子来更深入地探讨如何在Python中使用Beautiful Soup和Scrapy结合HTML和CSS选择器来提取网页数据。

使用 Beautiful Soup 的例子

假设我们要从一个网站上抓取所有文章标题和作者信息，我们可以这样做：

from bs4 import BeautifulSoup
import requests

def fetch_data(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 假设标题在 <h2> 标签内，并且有 class "title"
    titles = [tag.text for tag in soup.find_all('h2', class_='title')]
    
    # 假设作者信息在 <span> 标签内，并且有 class "author"
    authors = [tag.text for tag in soup.find_all('span', class_='author')]
    
    return titles, authors

url = 'http://example.com/articles'
titles, authors = fetch_data(url)

print("Titles:", titles)
print("Authors:", authors)

使用 Scrapy 的例子

Scrapy 是一个更强大的框架，它允许异步请求和更复杂的爬虫逻辑。以下是一个简单的Scrapy爬虫的例子：

import scrapy

class ArticleSpider(scrapy.Spider):
    name = 'article_spider'
    start_urls = ['http://example.com/articles']

    def parse(self, response):
        # 使用 CSS 选择器来获取文章标题
        titles = response.css('h2.title::text').getall()
        
        # 获取作者信息
        authors = response.css('span.author::text').getall()
        
        # 遍历每一篇文章
        for title, author in zip(titles, authors):
            yield {
                'title': title,
                'author': author
            }

为了运行这个Scrapy爬虫，你需要在你的项目目录下创建一个与你的爬虫名称匹配的文件夹（例如 article_spider），并在其中创建一个 spiders 文件夹。然后在 spiders 文件夹里创建一个以你的爬虫名字命名的.py文件（例如 article_spider.py），并把上面的代码放进去。

处理分页和更复杂的数据结构

如果网站有分页，你可能需要在Scrapy中使用回调函数来处理每个页面。例如：

class ArticleSpider(scrapy.Spider):
    name = 'article_spider'
    start_urls = ['http://example.com/articles/page/1']

    def parse(self, response):
        # ... 提取文章标题和作者的代码 ...

        # 获取下一页的链接
        next_page = response.css('a.next-page::attr(href)').get()

        if next_page is not None:
            yield response.follow(next_page, self.parse)

在这个例子中，response.follow 方法会发送一个新的请求到 next_page URL，并调用 self.parse 函数处理响应。

以上就是使用Beautiful Soup和Scrapy结合HTML和CSS选择器进行数据抓取的基本示例。在实际应用中，你可能需要根据目标网站的具体HTML结构调整选择器。

在之前的代码基础上，我们可以进一步扩展功能，比如处理异常、增加日志记录以及优化数据存储。下面的示例将展示如何在 Beautiful Soup 和 Scrapy 中实现这些功能：

使用 Beautiful Soup 添加异常处理和日志记录

import logging
import requests
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO)

def fetch_data(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raises an HTTPError if the HTTP request returned an unsuccessful status code
        soup = BeautifulSoup(response.text, 'html.parser')
        
        titles = [tag.text for tag in soup.find_all('h2', class_='title')]
        authors = [tag.text for tag in soup.find_all('span', class_='author')]
        
        logging.info(f"Fetched {len(titles)} articles from {url}")
        return titles, authors
    except requests.exceptions.RequestException as e:
        logging.error(f"Error fetching data from {url}: {e}")
        return [], []

url = 'http://example.com/articles'
titles, authors = fetch_data(url)

if titles and authors:
    print("Titles:", titles)
    print("Authors:", authors)
else:
    logging.warning("No data fetched.")

在 Scrapy 中添加数据存储到 CSV 文件

在 Scrapy 中，你可以使用内置的 FeedExporter 来保存数据到不同的格式，如 JSON 或 CSV。以下是如何设置 Scrapy 项目以保存数据到 CSV 文件：

在你的 Scrapy 项目的 settings.py 文件中，添加以下配置：

FEEDS = {
    'articles.csv': {'format': 'csv'},
}

这将告诉 Scrapy 把输出保存到名为 articles.csv 的 CSV 文件中。

确保你的爬虫代码正确生成字典，并在 yield 中返回：

class ArticleSpider(scrapy.Spider):
    name = 'article_spider'
    start_urls = ['http://example.com/articles']

    def parse(self, response):
        titles = response.css('h2.title::text').getall()
        authors = response.css('span.author::text').getall()
        
        for title, author in zip(titles, authors):
            yield {
                'title': title,
                'author': author
            }

当你运行 Scrapy 爬虫时，它将自动将数据写入 articles.csv 文件。

处理更复杂的数据结构

如果数据结构非常复杂，可能需要多次解析或使用递归。例如，如果文章的详细信息在另一个页面上，你可以使用 Scrapy 的 follow 方法来访问和解析详情页面：

class ArticleSpider(scrapy.Spider):
    name = 'article_spider'
    start_urls = ['http://example.com/articles']

    def parse(self, response):
        article_links = response.css('a.article-link::attr(href)').getall()
        
        for link in article_links:
            yield response.follow(link, self.parse_article)

    def parse_article(self, response):
        title = response.css('h1.article-title::text').get()
        content = response.css('div.article-content *::text').getall()
        
        yield {
            'title': title,
            'content': content
        }