爬取名人名言:http://quotes.toscrape.com/
1 创建爬虫项目,在终端中输入:
scrapy startproject quotes
2 创建之后,在spiders文件夹下面创建爬虫文件quotes.py,内容如下:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Quotes(CrawlSpider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
rules = (
Rule(LinkExtractor(allow='/page/\d+'), callback='parse_quotes', follow=True),
Rule(LinkExtractor(allow='/author/\w+'), callback='parse_author')
)
def parse_quotes(self, response):
for quote in response.css('quote'):
yield {
'content': quote.css('.text::text').extract_first(),
'author': quote.css('.author::text').extract_first(),
'tags': quote.css('.tag::text').extract_first()
}
def parse_author(selfself, response):
name = response.css('.author-title::text').extract_first()
author_born_date = response.css('.author-born-date::text').extract_first()
author_born_location = response.css('.author-born-location::text').extract_first()
author_description = response.css('.author-description::text').extract_first()
return ({
'name': name,
'author_born_date': author_born_date,
'author_born_location': author_born_location,
'author_description': author_description
})
目录结构如下:
3 运行爬虫
在终端中执行scrapy crawl quotes,结果如图所示:
到此,一个简单的爬虫就完成了。