爬虫入门到精通_框架篇14(PySpider架构概述及用法详解)

官方文档
Sample Code：

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

	# minutes=24 * 60：每隔一天重新爬取
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

   # age:过期时间
    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

1 架构(Architecture)

在这里插入图片描述
scheduler:调度器
feter:请求器，请求网页
processor:用来处理数据，处理的url可以重新调用调度器
monitor&webui:监视器与webUI

scheduler

维护request队列，默认是存储在本地的数据库

feter

在这里插入图片描述
可以用两种模式：一种是直接请求，一种是js的请求。

processor

处理数据，将url重新封装起来，可以重新调用调度器。

2 Command Line

Global Config

global options work for all subcommands.

Usage: pyspider [OPTIONS] COMMAND [ARGS]...

  A powerful spider system in python.

Options:
  -c, --config FILENAME    a json file with default values for subcommands.
                           {“webui”: {“port”:5001}}
  --logging-config TEXT    logging config file for built-in python logging
                           module  [default: pyspider/pyspider/logging.conf]
  --debug                  debug mode
  --queue-maxsize INTEGER  maxsize of queue
  --taskdb TEXT            database url for taskdb, default: sqlite
  --projectdb TEXT         database url for projectdb, default: sqlite
  --resultdb TEXT          database url for resultdb, default: sqlite
  --message-queue TEXT     connection url to message queue, default: builtin
                           multiprocessing.Queue
  --amqp-url TEXT          [deprecated] amqp url for rabbitmq. please use
                           --message-queue instead.
  --beanstalk TEXT         [deprecated] beanstalk config for beanstalk queue.
                           please use --message-queue instead.
  --phantomjs-proxy TEXT   phantomjs proxy ip:port
  --data-path TEXT         data dir path
  --version                Show the version and exit.
  --help                   Show this message and exit.

config

pyspider启动的时候的一些配置选项

{
  "taskdb": "mysql+taskdb://username:password@host:port/taskdb",
  "projectdb": "mysql+projectdb://username:password@host:port/projectdb",
  "resultdb": "mysql+resultdb://username:password@host:port/resultdb",
  "message_queue": "amqp://username:password@host:port/%2F",
  "webui": {
    "username": "some_name",
    "password": "some_passwd",
    "need-auth": true
  }
}

项目启动时，会生成上述3个db,在当前项目的data目录下
在这里插入图片描述
webui：指定好界面访问的用户名密码

此时弹出：

all

运行所有的组件

Usage: pyspider all [OPTIONS]

  Run all the components in subprocess or thread

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --help                        Show this message and exit.

在这里插入图片描述

one

在一个进程下运行所有的组件

Usage: pyspider one [OPTIONS] [SCRIPTS]...

  One mode not only means all-in-one, it runs every thing in one process
  over tornado.ioloop, for debug purpose

Options:
  -i, --interactive  enable interactive mode, you can choose crawl url.
  --phantomjs        enable phantomjs, will spawn a subprocess for phantomjs
  --help             Show this message and exit.

bench

请求测试

Usage: pyspider bench [OPTIONS]

  Run Benchmark test. In bench mode, in-memory sqlite database is used
  instead of on-disk sqlite database.

Options:
  --fetcher-num INTEGER         instance num of fetcher
  --processor-num INTEGER       instance num of processor
  --result-worker-num INTEGER   instance num of result worker
  --run-in [subprocess|thread]  run each components in thread or subprocess.
                                always using thread for windows.
  --total INTEGER               total url in test page
  --show INTEGER                show how many urls in a page
  --help                        Show this message and exit.

已下组件都是单独开启

scheduler

Usage: pyspider scheduler [OPTIONS]

  Run Scheduler, only one scheduler is allowed.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --inqueue-limit INTEGER  size limit of task queue for each project, tasks
                           will been ignored when overflow
  --delete-time INTEGER    delete time before marked as delete
  --active-tasks INTEGER   active log size
  --loop-limit INTEGER     maximum number of tasks due with in a loop
  --scheduler-cls TEXT     scheduler class to be used.
  --help                   Show this message and exit.

phantomjs

Usage: run.py phantomjs [OPTIONS] [ARGS]...

  Run phantomjs fetcher if phantomjs is installed.

Options:
  --phantomjs-path TEXT  phantomjs path
  --port INTEGER         phantomjs port
  --auto-restart TEXT    auto restart phantomjs if crashed
  --help                 Show this message and exit.

fetcher

Usage: pyspider fetcher [OPTIONS]

  Run Fetcher.

Options:
  --xmlrpc / --no-xmlrpc
  --xmlrpc-host TEXT
  --xmlrpc-port INTEGER
  --poolsize INTEGER      max simultaneous fetches
  --proxy TEXT            proxy host:port
  --user-agent TEXT       user agent
  --timeout TEXT          default fetch timeout
  --fetcher-cls TEXT      Fetcher class to be used.
  --help                  Show this message and exit.

processor

Usage: pyspider processor [OPTIONS]

  Run Processor.

Options:
  --processor-cls TEXT  Processor class to be used.
  --help                Show this message and exit.

result_worker

Usage: pyspider result_worker [OPTIONS]

  Run result worker.

Options:
  --result-cls TEXT  ResultWorker class to be used.
  --help             Show this message and exit.

webui

Usage: pyspider webui [OPTIONS]

  Run WebUI

Options:
  --host TEXT            webui bind to host
  --port INTEGER         webui bind to host
  --cdn TEXT             js/css cdn server
  --scheduler-rpc TEXT   xmlrpc path of scheduler
  --fetcher-rpc TEXT     xmlrpc path of fetcher
  --max-rate FLOAT       max rate for each project
  --max-burst FLOAT      max burst for each project
  --username TEXT        username of lock -ed projects
  --password TEXT        password of lock -ed projects
  --need-auth            need username and password
  --webui-instance TEXT  webui Flask Application instance to be used.
  --help                 Show this message and exit.

在这里插入图片描述

3 API Reference

self.crawl(url, **kwargs)

self.crawl is the main interface to tell pyspider which url(s) should be crawled.
发起请求的函数

Parameters:

url

the url or url list to be crawled.
请求的网页

callback

the method to parse the response. _default: call _
回调函数

def on_start(self):
    self.crawl('http://scrapy.org/', callback=self.index_page)

the following parameters are optional

age

the period of validity of the task. The page would be regarded as not modified during the period. default: -1(never recrawl)
请求的过期时间，单位秒，过期了才重新请求

@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    ...

Every pages parsed by the callback index_page would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.

priority

the priority of task to be scheduled, higher the better. default: 0
指定爬取优先级.

def index_page(self):
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,
               priority=1)

The page 233.html would be crawled before page2.html. Use this parameter can do a BFS and reduce the number of tasks in queue(which may cost more memory resources).

exetime

the executed time of task in unix timestamp. default: 0(immediately)
指定执行时间

import time
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               exetime=time.time()+30*60)

The page would be crawled 30 minutes later.

retries

retry times while failed. default: 3
失败之后的重试次数，默认时3，最大10.

itag

a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it’s changed. default: None
标识符，

def index_page(self, response):
    for item in response.doc('.item').items():
        self.crawl(item.find('a').attr.url, callback=self.detail_page,
                   itag=item.find('.update-time').text())

In the sample, .update-time is used as itag. If it’s not changed, the request would be discarded.
如果没有改变，不进行重新爬取

Or you can use itag with Handler.crawl_config to specify the script version if you want to restart all of the tasks.

class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v223'
    }

Change the value of itag after you modified the script and click run button again. It doesn’t matter if not set before.

auto_recrawl

when enabled, task would be recrawled every age time. default: False
自动重爬，如果开启，age过期后自动重爬。

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               age=5*60*60, auto_recrawl=True)

The page would be restarted every age 5 hours.

method

HTTP method to use. default: GET
HTTP的请求方法。

params

dictionary of URL parameters to append to the URL.
get请求的一些参数.

def on_start(self):
    self.crawl('http://httpbin.org/get', callback=self.callback,
               params={'a': 123, 'b': 'c'})
    self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)

The two requests are the same.

data

the body to attach to the request. If a dictionary is provided, form-encoding will take place.
post请求的一些data.

def on_start(self):
    self.crawl('http://httpbin.org/post', callback=self.callback,
               method='POST', data={'a': 123, 'b': 'c'})

files

dictionary of {field: {filename: ‘content’}} files to multipart upload.
上传的文件.

user_agent

the User-Agent of the request

headers

dictionary of headers to send.

cookies

dictionary of cookies to attach to this request.

connect_timeout

timeout for initial connection in seconds. default: 20

timeout

maximum time in seconds to fetch the page. default: 120

allow_redirects

follow 30x redirect default: True
一些网页的302的跳转是否重定向设置.

validate_cert

For HTTPS requests, validate the server’s certificate? default: True
HTTPS的证书请求忽略.

proxy

proxy server of username:password@hostname:port to use, only http proxy is supported currently.

class Handler(BaseHandler):
    crawl_config = {
        'proxy': 'localhost:8080'
    }

Handler.crawl_config can be used with proxy to set a proxy for whole project.
设置代理.

etag

use HTTP Etag mechanism to pass the process if the content of the page is not changed. default: True
判断页面更新爬取情况.

last_modified

use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. default: True

fetch_type

set to js to enable JavaScript fetcher. default: None
设置为js请求.渲染查找后的页面.

js_script

JavaScript run before or after page loaded, should been wrapped by a function like function() { document.write(“binux”); }.
网页爬取后，执行脚本.

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               fetch_type='js', js_script='''
               function() {
                   window.scrollTo(0,document.body.scrollHeight);
                   return 123;
               }
               ''')

The script would scroll the page to bottom. The value returned in function could be captured via Response.js_script_result.

js_run_at

run JavaScript specified via js_script at document-start or document-end. default: document-end
把脚本加到当我document前面还是后面.

js_viewport_width/js_viewport_height

set the size of the viewport for the JavaScript fetcher of the layout process.
JS视窗的大小.

load_images

load images when JavaScript fetcher enabled. default: False
加载时是否加载图片.

save

a object pass to the callback method, can be visit via response.save.
用来在多个函数直接传递变量的参数

def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               save={'a': 123})

def callback(self, response):
    return response.save['a']

123 would be returned in callback

taskid

unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method def get_taskid(self, task)
指定task的唯一标识码，ruquest队列的消息去重.

import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
    return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))

Only url is md5 -ed as taskid by default, the code above add data of POST request as part of taskid.

force_update

force update task params even if the task is in ACTIVE status.
强制更新.

cancel

cancel a task, should be used with force_update to cancel a active task. To cancel an auto_recrawl task, you should set auto_recrawl=False as well.
取消任务.

@config(**kwargs)

default parameters of self.crawl when use the decorated method as callback. For example:
config里可以传递参数.

@config(age=15*60)
def index_page(self, response):
    self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
    self.crawl('http://www.example.org/product-233', callback=self.detail_page)

@config(age=10*24*60*60)
def detail_page(self, response):
    return {...}

age of list-1.html is 15min while the age of product-233.html is 10days. Because the callback of product-233.html is detail_page, means it’s a detail_page so it shares the config of detail_page.

Handler.crawl_config = {}

default parameters of self.crawl for the whole project. The parameters in crawl_config for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.
将一些配置放入其中，使其全局生效.

class Handler(BaseHandler):
    crawl_config = {
        'headers': {
            'User-Agent': 'GoogleBot',
        }
    }

    ...

Response

Response.url

final URL.

Response.text

Content of response, in unicode.
返回网页源代码.
if Response.encoding is None and chardet module is available, encoding of content will be guessed.

Response.content

Content of response, in bytes.
返回网页源代码,二进制.

Response.doc

A PyQuery object of the response’s content. Links have made as absolute by default.
网页解析，调用 PyQuery的解析库.
Refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

It’s important that I will repeat, refer to the documentation of PyQuery: https://pythonhosted.org/pyquery/

Response.etree

A lxml object of the response’s content.
返回lxml对象

Response.json

The JSON-encoded content of the response, if any.
转换为Json格式.

Response.status_code

状态码：200,300等等.

Response.orig_url

If there is any redirection during the request, here is the url you just submit via self.crawl.
有重定向，就显示最原始的url.

Response.headers

A case insensitive dict holds the headers of response.

Response.cookies

Response.error

Messages when fetch error

Response.time

Time used during fetching.

Response.ok

True if status_code is 200 and no error.
判断请求是否OK.

Response.encoding

Encoding of Response.content.
用来指定编码
If Response.encoding is None, encoding will be guessed by header or content or chardet(if available).
Set encoding of content manually will overwrite the guessed encoding.

Response.save

The object saved by self.crawl API
指定不同函数直接传递参数.

Response.js_script_result

content returned by JS script
接受JS script的结果.

Response.raise_for_status()

Raise HTTPError if status code is not 200 or Response.error exists.
抛出一些错误.

self.send_message

部署过程中，消除队列的配置.

@every(minutes=0, seconds=0)

定时爬取中用的比较多.
method will been called every minutes or seconds

@every(minutes=24 * 60)
def on_start(self):
    for url in urllist:
        self.crawl(url, callback=self.index_page)

一天执行一次
The urls would be restarted every 24 hours. Note that, if age is also used and the period is longer then @every, the crawl request would be discarded as it’s regarded as not changed:

@every(minutes=24 * 60)
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.index_page)

@config(age=10 * 24 * 60 * 60)
def index_page(self):
    ...

Even though the crawl request triggered every day, but it’s discard and only restarted every 10 days.

4 Frequently Asked Questions

Unreadable Code (乱码) Returned from Phantomjs

Phantomjs doesn’t support gzip, don’t set Accept-Encoding header with gzip.

How to Delete a Project?

set group to delete and status to STOP then wait 24 hours. You can change the time before a project deleted via
scheduler.DELETE_TIME.
status to STOP:
在这里插入图片描述
group to delete:

How to Restart a Project?

Why
It happens after you modified a script, and wants to crawl everything again with new strategy. But as the age of urls are not expired. Scheduler will discard all of the new requests.
Solution
1.Create a new project.
2.Using a itag within Handler.crawl_config to specify the version of your script.

What does the progress bar mean on the dashboard?

When mouse move onto the progress bar, you can see the explaintions.
For 5m, 1h, 1d the number are the events triggered in 5m, 1h, 1d. For all progress bar, they are the number of total tasks in correspond status.
Only the tasks in DEBUG/RUNNING status will show the progress.
在这里插入图片描述