Python爬虫采集框架——Scrapy初学入门

news2025/4/15 20:00:00

一、安装Scrapy依赖包

pip install Scrapy

二、创建Scrapy项目（tutorial）

scrapy startproject tutorial

项目目录包含以下内容

tutorial/
    scrapy.cfg            # deploy configuration file
    tutorial/             # project's Python module, you'll import your code from here
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # a directory where you'll later put your spiders
            __init__.py

三、tutorial/spiders目录下编写蜘蛛（quotes_spider.py）

1、蜘蛛（version1.0）

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {filename}')

蜘蛛类QuotesSpider必须得继承scrapy.Spider，并定义以下属性和方法：

name：蜘蛛名称（它在一个项目中必须是唯一的）

start_requests()：蜘蛛开始请求（方法返回的是请求的iterable）

请求：scrapy.Request(url=url, callback=self.parse)

parse()：蜘蛛解析请求的响应（Response参数是响应的页面内容）

2、蜘蛛（version2.0）

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

只需定义start_urls属性，不需重写start_requests() 方法，由默认的start_requests()方法根据start_urls属性开始请求也可。

3、蜘蛛（version3.0）

页面HTML代码：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

蜘蛛代码：

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

该蜘蛛将提取页面中的text、author和tags，并使用终端打印。

4、蜘蛛（version4.0）

下一页HTML代码：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

蜘蛛代码：

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

该蜘蛛将提取页面中的text、author和tags，使用终端打印，提取下一页的地址并开始请求，以此递归。

#做如下测试操作
urljoin("http://www.xoxxoo.com/a/b/c.html", "d.html")  
#结果为：'http://www.xoxxoo.com/a/b/d.html'  
urljoin("http://www.xoxxoo.com/a/b/c.html", "/d.html") 
#结果为：'http://www.xoxxoo.com/d.html'   
urljoin("http://www.xoxxoo.com/a/b/c.html", "../d.html") 
#结果为：'http://www.xoxxoo.com/a/d.html'

5、蜘蛛（version5.0）

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

蜘蛛（version4.0）和蜘蛛（version5.0）的区别：scrapy.Request支持绝对URL，而response.follow支持相对URL，并返回一个请求实例。

6、蜘蛛（version6.0）

import scrapy
class AuthorSpider(scrapy.Spider):
    name = 'author'
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        author_page_links = response.css('.author + a')
        yield from response.follow_all(author_page_links, self.parse_author)
        pagination_links = response.css('li.next a')
        yield from response.follow_all(pagination_links, self.parse)
    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()
        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

使用选择器选择<a>元素，会默认返回href属性，因此response.css('li.next a::attr(href)')与response.css('li.next a')是一样的。

response.follow_all与response.follow的区别是：前者返回多个请求实例的iterable，而后者返回一个请求实例。

通过设置 DUPEFILTER_CLASS，scrapy可以过滤掉已经访问过的URL，避免了重复请求。

四、使用蜘蛛参数

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

运行命令行

scrapy crawl quotes -O quotes-humor.json -a tag=humor

通过-a传递的参数，被蜘蛛的 __init__ 方法赋值为蜘蛛的属性。在本例中，通过getattr()获取该参数并使用。

五、CSS选择器与XPath

scrapy shell是一个交互式shell，可以用来快速调试scrape代码，特别是测试数据提取代码（CSS选择器与XPath）。打开命令如下：

scrapy shell "http://quotes.toscrape.com/page/1/"

网页源码：

<head>
    <meta charset="UTF-8">
    <title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>

CSS选择器提取代码

//获取符合的数据的列表
>>>response.css('title::text').getall()
//获取第一个符合的数据，没有数据返回None
>>>response.css('title::text').get()
//获取第一个符合的数据，没有数据会引发IndexError
>>>response.css('title::text')[0].get()
//css选择器选择出来后进行正则匹配
>>>response.css('title::text').re(r'Quotes.*')

XPath提取代码

>>>response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>>response.xpath('//title/text()').get()
'Quotes to Scrape'