一、安装Scrapy依赖包
pip install Scrapy
二、创建Scrapy项目(tutorial)
scrapy startproject tutorial
项目目录包含以下内容
tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
三、tutorial/spiders目录下编写蜘蛛(quotes_spider.py)
1、蜘蛛(version1.0)
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')
蜘蛛类QuotesSpider必须得继承scrapy.Spider,并定义以下属性和方法:
name:蜘蛛名称(它在一个项目中必须是唯一的)
start_requests():蜘蛛开始请求(方法返回的是请求的iterable)
请求:scrapy.Request(url=url, callback=self.parse)
parse():蜘蛛解析请求的响应(Response参数是响应的页面内容)
2、蜘蛛(version2.0)
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'quotes-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
只需定义start_urls属性,不需重写start_requests() 方法,由默认的start_requests()方法根据start_urls属性开始请求也可。
3、蜘蛛(version3.0)
页面HTML代码:
<div class="quote">
<span class="text">“The world as we have created it is a process of our
thinking. It cannot be changed without changing our thinking.”</span>
<span>
by <small class="author">Albert Einstein</small>
<a href="/author/Albert-Einstein">(about)</a>
</span>
<div class="tags">
Tags:
<a class="tag" href="/tag/change/page/1/">change</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
<a class="tag" href="/tag/thinking/page/1/">thinking</a>
<a class="tag" href="/tag/world/page/1/">world</a>
</div>
</div>
蜘蛛代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
该蜘蛛将提取页面中的text、author和tags,并使用终端打印。
4、蜘蛛(version4.0)
下一页HTML代码:
<ul class="pager">
<li class="next">
<a href="/page/2/">Next <span aria-hidden="true">→</span></a>
</li>
</ul>
蜘蛛代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
该蜘蛛将提取页面中的text、author和tags,使用终端打印,提取下一页的地址并开始请求,以此递归。
#做如下测试操作
urljoin("http://www.xoxxoo.com/a/b/c.html", "d.html")
#结果为:'http://www.xoxxoo.com/a/b/d.html'
urljoin("http://www.xoxxoo.com/a/b/c.html", "/d.html")
#结果为:'http://www.xoxxoo.com/d.html'
urljoin("http://www.xoxxoo.com/a/b/c.html", "../d.html")
#结果为:'http://www.xoxxoo.com/a/d.html'
5、蜘蛛(version5.0)
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
蜘蛛(version4.0)和蜘蛛(version5.0)的区别:scrapy.Request支持绝对URL,而response.follow支持相对URL,并返回一个请求实例。
6、蜘蛛(version6.0)
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)
pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
使用选择器选择<a>元素,会默认返回href属性,因此response.css('li.next a::attr(href)')与response.css('li.next a')是一样的。
response.follow_all与response.follow的区别是:前者返回多个请求实例的iterable,而后者返回一个请求 实例。
通过设置 DUPEFILTER_CLASS,scrapy可以过滤掉已经访问过的URL,避免了重复请求。
四、使用蜘蛛参数
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
运行命令行
scrapy crawl quotes -O quotes-humor.json -a tag=humor
通过-a传递的参数,被蜘蛛的 __init__ 方法赋值为蜘蛛的属性。在本例中,通过getattr()获取该参数并使用。
五、CSS选择器与XPath
scrapy shell是一个交互式shell,可以用来快速调试scrape代码,特别是测试数据提取代码(CSS选择器与XPath)。打开命令如下:
scrapy shell "http://quotes.toscrape.com/page/1/"
网页源码:
<head>
<meta charset="UTF-8">
<title>Quotes to Scrape</title>
<link rel="stylesheet" href="/static/bootstrap.min.css">
<link rel="stylesheet" href="/static/main.css">
</head>
CSS选择器提取代码
//获取符合的数据的列表
>>>response.css('title::text').getall()
//获取第一个符合的数据,没有数据返回None
>>>response.css('title::text').get()
//获取第一个符合的数据,没有数据会引发IndexError
>>>response.css('title::text')[0].get()
//css选择器选择出来后进行正则匹配
>>>response.css('title::text').re(r'Quotes.*')
XPath提取代码
>>>response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>>response.xpath('//title/text()').get()
'Quotes to Scrape'