DEBUG: Ignoring response < 403 https: / / digital. ucas. com/ coursedisplay/ results/ courses? studyYear= 2024 > : HTTP status code is not handled or not allowed
原因:被屏蔽了,在settings.py 里面配一下USER_AGENT(随便写一个就行)
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36'
然后就畅通无阻了
2023 - 10 - 31 13 : 41 : 56 [ scrapy. utils. log] INFO: Scrapy 2.10 .0 started ( bot: ucas_under)
2023 - 10 - 31 13 : 41 : 56 [ scrapy. utils. log] INFO: Versions: lxml 4.9 .3 .0 , libxml2 2.10 .3 , cssselect 1.2 .0 , parsel 1.8 .1 , w3lib 2.1 .1 , Twisted 22.10 .0 , Python 3.9 .16 ( main, May 17 2023 , 17 : 49 : 16 ) [ MSC v. 1916 64 bit ( AMD64) ] , pyOpenSSL 23.2 .0 ( OpenSSL 3.1 .1 30 May 2023 ) , cryptography 41.0 .1 , Platform Windows- 10 - 10.0 .19045 - SP0
2023 - 10 - 31 13 : 41 : 56 [ scrapy. addons] INFO: Enabled addons:
[ ]
2023 - 10 - 31 13 : 41 : 56 [ scrapy. crawler] INFO: Overridden settings:
{ 'AUTOTHROTTLE_ENABLED' : True,
'BOT_NAME' : 'ucas_under' ,
'COOKIES_ENABLED' : False,
'DOWNLOAD_DELAY' : 5 ,
'FEED_EXPORT_ENCODING' : 'utf-8' ,
'LOG_FILE' : 'log/ucas_under.log' ,
'LOG_LEVEL' : 'INFO' ,
'NEWSPIDER_MODULE' : 'ucas_under.spiders' ,
'REQUEST_FINGERPRINTER_IMPLEMENTATION' : '2.7' ,
'SPIDER_MODULES' : [ 'ucas_under.spiders' ] ,
'TWISTED_REACTOR' : 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' ,
'USER_AGENT' : 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/55.0.2883.87 Safari/537.36' }
2023 - 10 - 31 13 : 41 : 56 [ scrapy. extensions. telnet] INFO: Telnet Password: d51ffe3ad1833b8d
2023 - 10 - 31 13 : 41 : 56 [ scrapy. middleware] INFO: Enabled extensions:
[ 'scrapy.extensions.corestats.CoreStats' ,
'scrapy.extensions.telnet.TelnetConsole' ,
'scrapy.extensions.logstats.LogStats' ,
'scrapy.extensions.throttle.AutoThrottle' ]
2023 - 10 - 31 13 : 41 : 56 [ scrapy. middleware] INFO: Enabled downloader middlewares:
[ 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware' ,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware' ,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware' ,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware' ,
'scrapy.downloadermiddlewares.retry.RetryMiddleware' ,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware' ,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware' ,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware' ,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware' ,
'scrapy.downloadermiddlewares.stats.DownloaderStats' ]
2023 - 10 - 31 13 : 41 : 56 [ scrapy. middleware] INFO: Enabled spider middlewares:
[ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware' ,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware' ,
'scrapy.spidermiddlewares.referer.RefererMiddleware' ,
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware' ,
'scrapy.spidermiddlewares.depth.DepthMiddleware' ]
2023 - 10 - 31 13 : 41 : 56 [ scrapy. middleware] INFO: Enabled item pipelines:
[ 'ucas_under.pipelines.UcasUnderPipeline' ]
2023 - 10 - 31 13 : 41 : 56 [ scrapy. core. engine] INFO: Spider opened
2023 - 10 - 31 13 : 41 : 56 [ scrapy. extensions. logstats] INFO: Crawled 0 pages ( at 0 pages/ min) , scraped 0 items ( at 0 items/ min)
2023 - 10 - 31 13 : 41 : 56 [ scrapy. extensions. telnet] INFO: Telnet console listening on 127.0 .0 .1 : 6023
2023 - 10 - 31 13 : 41 : 59 [ scrapy. core. engine] INFO: Closing spider ( finished)
2023 - 10 - 31 13 : 41 : 59 [ scrapy. statscollectors] INFO: Dumping Scrapy stats:
{ 'downloader/request_bytes' : 604 ,
'downloader/request_count' : 1 ,
'downloader/request_method_count/GET' : 1 ,
'downloader/response_bytes' : 124343 ,
'downloader/response_count' : 1 ,
'downloader/response_status_count/200' : 1 ,
'elapsed_time_seconds' : 2.85669 ,
'finish_reason' : 'finished' ,
'finish_time' : datetime. datetime ( 2023 , 10 , 31 , 5 , 41 , 59 , 106002 ) ,
'httpcompression/response_bytes' : 544169 ,
'httpcompression/response_count' : 1 ,
'log_count/INFO' : 10 ,
'response_received_count' : 1 ,
'scheduler/dequeued' : 1 ,
'scheduler/dequeued/memory' : 1 ,
'scheduler/enqueued' : 1 ,
'scheduler/enqueued/memory' : 1 ,
'start_time' : datetime. datetime ( 2023 , 10 , 31 , 5 , 41 , 56 , 249312 ) }
2023 - 10 - 31 13 : 41 : 59 [ scrapy. core. engine] INFO: Spider closed ( finished)