CrawlSpider
1. CrawlSpider继承自scrapy.Spider
2. CrawlSpider可以定义规则,再解析html内容的时候,可以根据链接规则提取出指定的链接,然后再向这些链接发送请求,所以,如果有需要跟进链接的需求,意思就是爬取了网页之后,需要提取链接再次爬取,使用Crawlspider是非常合适的
使用scrapy shell提取:
1. 在命令提示符中输入: scrapy shell https://www.dushu.com/lianzai/1115.html
2. 导入链接提取器:from scrapy.linkextractors import LinkExtractor
3. allow = () :正则表达式 ,提取符合正则的链接
5. 查看连接提取器提取的内容
6. restrict_xpaths = () :xpath语法,提取符合xpath规则的链接
查看提取的内容:
7. restrict_css = () :提取符合选择器规则的链接
小案例:
1. 创建项目:scrapy startproject 项目名
2. 跳转到spider目录下: cd .\项目名\项目名\spiders\
3. 创建爬虫类:scrapy genspider -t crawl 爬虫文件名 要爬取的网页 (这里与之前的不一样)
4. 运行:scrapy crawl 爬虫文件名
callback只能写函数字符串
follow=true 是否跟进 就是按照提取连接规则进行提取
爬虫文件:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy_readbook_20240120.items import ScrapyReadbook20240120Item
class RbookSpider(CrawlSpider):
name = "rbook"
allowed_domains = ["www.dushu.com"]
start_urls = ["https://www.dushu.com/lianzai/1115_1.html"]
rules = (Rule(LinkExtractor(allow=r"/lianzai/1115_\d+.html"),
callback="parse_item",
follow=False),
)
def parse_item(self, response):
print("++++++++++++++++++++")
img_list = response.xpath("//div[@class='bookslist']//img")
for img in img_list:
src = img.xpath("./@data-original").extract_first()
name = img.xpath("./@alt").extract_first()
book = ScrapyReadbook20240120Item(name=name, src=src)
yield book
pipelines.py文件
class ScrapyReadbook20240120Pipeline:
def open_spider(self, spider):
self.fp = open("book.json", "w", encoding="utf-8")
def process_item(self, item, spider):
self.fp.write(str(item))
return item
def close_spider(self, spider):
self.fp.close()
items.py文件
import scrapy
class ScrapyReadbook20240120Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
src = scrapy.Field()
settings.py文件
# 开启管道
ITEM_PIPELINES = {
"scrapy_readbook_20240120.pipelines.ScrapyReadbook20240120Pipeline": 300,
}
保存在数据库中:
1. 创建数据库
create database 数据库名字 charset utf8;
2. 使用数据库
use 数据库名字;
3. 创建表格 :例子
create table 表名(
id int primary key auto_increment,
name varchar(128),
src varchar(128)
);
4. 在settings.py 文件中添加 ip地址、端口号、数据库密码、数据库名字、字符集
DB_HOST = "localhost" # ip地址
DB_PORT = 3306 # 端口号,必须是整数
DB_USER = "root" # 数据库用户名
DB_PASSWORD = "123456" # 数据库密码
DB_NAME = "rbook" # 数据库名字
DB_CHARSET = "utf8" # 字符集,不允许写 -
5. 在pipelines管道文件中增加
# 加载settings文件
from scrapy.utils.project import get_project_settings
import pymysql
class MysqlPipeline:
def open_spider(self, spider):
settings = get_project_settings()
self.host = settings["DB_HOST"] # ip地址
self.port = settings["DB_PORT"] # 端口号
self.user = settings["DB_USER"] # 数据库用户名
self.password = settings["DB_PASSWORD"] # 数据库密码
self.name = settings["DB_NAME"] # 数据库名字
self.charset = settings["DB_CHARSET"] # 字符集
self.connect()
def connect(self):
self.conn = pymysql.connect(
host=self.host,
port=self.port,
user=self.user,
password=self.password,
db=self.name,
charset=self.charset
)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
sql = "insert into rbook(name,src) values('{}','{}')".format(item["name"], item["src"])
# 执行sql语句
self.cursor.execute(sql)
# 提交sql语句
self.conn.commit()
return item
def close_spider(self, spider):
# 关闭数据库链接
self.cursor.close()
self.conn.close()
6. settings文件:添加新的管道
ITEM_PIPELINES = {
"scrapy_readbook_20240120.pipelines.ScrapyReadbook20240120Pipeline": 300,
"scrapy_readbook_20240120.pipelines.MysqlPipeline": 301,
}
7. 若要一直下载,把所有数据都下载,则需要把爬虫文件里的 follow 的值设为 True
数据库的数据: