京东商品信息爬虫程序:策略与实践

news2024/12/26 20:59:30

京东探索

京东案例

目标:爬取京东前三页商品数据,利用协程

思路:

  1. 爬取动态网站,首先分析接口链接,对比什么参数该变,什么参数可以不变。
    • 原则:尽量与原链接相同,即使不加某个参数能访问,但也尽量能加上就加上,因为服务器可能会识别出来进行投毒(不封禁IP,而是将错误数据给你)
  2. 构建headers
    • 常用的有refer(源网址)、origin(源网址)、authority(身份标识)、Cookie(身份标识)、User-Agent(自我介绍:浏览器、电脑相关信息)等;
  3. 如果不存在加密参数,通过以上方式即可访问成功
  4. xpath解析数据
    • 注意:对于接口返回的数据
      • 可能是部分HTML代码(不可用完整的xpath)
      • 可能是JSON数据(不用xpath,取出来即可)
  5. 入库

源码:

# 请求链接
# page2: https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720405956855&body=%7B%22keyword%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22qrst%22%3A%221%22%2C%22wq%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22stock%22%3A%221%22%2C%22pvid%22%3A%2231f843077aa2407b95785a72340fe7e7%22%2C%22page%22%3A%222%22%2C%22s%22%3A%2227%22%2C%22scrolling%22%3A%22y%22%2C%22log_id%22%3A%221720405812225.4544%22%2C%22tpl%22%3A%221_M%22%2C%22isList%22%3A0%2C%22show_items%22%3A%22%22%7D&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720405516.1&area=5_148_0_0&h5st=20240708103236857%3B5t9mgnyi65z65t97%3Bf06cc%3Btk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV%3Bd9b05f319cfb3214b43322b0819b6e4f6400b1a94fa5f756059946f426746e39%3B4.7%3B1720405956857%3BVSTdEx_T1kXZarkswckNeGNuhxfvcx7qwVbjbfZbJCw-qtP330LH_RLHo3Rkwl9CoWL9PUmTlTAUXvDuzdc5HevP3O0_FOX9xfOSI0mNNg4T_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSAV7CNYAAAAADCQFPXFHNCSEDUX
# page3: https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720407258079&body=%7B%22keyword%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22qrst%22%3A%221%22%2C%22wq%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22stock%22%3A%221%22%2C%22pvid%22%3A%2231f843077aa2407b95785a72340fe7e7%22%2C%22isList%22%3A0%2C%22page%22%3A%223%22%2C%22s%22%3A%2257%22%2C%22click%22%3A%220%22%2C%22log_id%22%3A%221720405955645.6707%22%2C%22show_items%22%3A%22%22%7D&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720405516.1&area=5_148_0_0&h5st=20240708105418081%3B5t9mgnyi65z65t97%3Bf06cc%3Btk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV%3B5eb4d823b3d8dd2ff5859add39c06b452fb6dbcb07647f9b117418f6c7671564%3B4.7%3B1720407258081%3BVe3XdZE4uu81PFR3UERrJ8gt477o6yrRoFtAToB_YYSYld0tTDkgFG8h888NDDAnXYAup31SarNKXQ0_y4sWdI533d6lNV-w7tIjZOWlBOcz_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSAV7CNYAAAAADCQFPXFHNCSEDUX
# page4: https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720407349236&body=%7B%22keyword%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22qrst%22%3A%221%22%2C%22wq%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22stock%22%3A%221%22%2C%22pvid%22%3A%2231f843077aa2407b95785a72340fe7e7%22%2C%22page%22%3A%224%22%2C%22s%22%3A%2286%22%2C%22scrolling%22%3A%22y%22%2C%22log_id%22%3A%221720407256984.9129%22%2C%22tpl%22%3A%221_M%22%2C%22isList%22%3A0%2C%22show_items%22%3A%22%22%7D&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720405516.1&area=5_148_0_0&h5st=20240708105549238%3B5t9mgnyi65z65t97%3Bf06cc%3Btk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV%3B53a1df01312ad6e415be12f3c271e5fde0371759f24f6df9d931cb7400793dda%3B4.7%3B1720407349238%3BVW26R-pcaaZF5vMsSl1AT2KBeBy2Mrpwr3LyiC87YKTO9t9-5eogEsbfN2l2-FvE8ladWvhvaXYyZJgUs5nY8C52izSYXnyiFNimDBz0Y0j8_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSAV7CNYAAAAADCQFPXFHNCSEDUX
# page5: https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720407374827&body=%7B%22keyword%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22qrst%22%3A%221%22%2C%22wq%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22stock%22%3A%221%22%2C%22pvid%22%3A%2231f843077aa2407b95785a72340fe7e7%22%2C%22isList%22%3A0%2C%22page%22%3A%225%22%2C%22s%22%3A%22116%22%2C%22click%22%3A%220%22%2C%22log_id%22%3A%221720407348116.1660%22%2C%22show_items%22%3A%22%22%7D&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720405516.1&area=5_148_0_0&h5st=20240708105614829%3B5t9mgnyi65z65t97%3Bf06cc%3Btk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV%3B5d0169127b27592f8f4ed0c02afa60c6bd6f44f914f34b24623b173f52315e88%3B4.7%3B1720407374829%3BV2txYCUHGzP6Mdbj6ef1WEHd94GBw4BqN0XuAobRNPZRJ9h6goijEKMPd90k_-yy0h23f4vPUy24MZEFy-5Dkbl30-8KTuYJGRkWhbQHJagZ_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSAV7CNYAAAAADCQFPXFHNCSEDUX
# page6: https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720407399435&body=%7B%22keyword%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22qrst%22%3A%221%22%2C%22wq%22%3A%22%E5%A4%A7%E5%9C%B0%E7%93%9C%22%2C%22stock%22%3A%221%22%2C%22pvid%22%3A%2231f843077aa2407b95785a72340fe7e7%22%2C%22page%22%3A%226%22%2C%22s%22%3A%22146%22%2C%22scrolling%22%3A%22y%22%2C%22log_id%22%3A%221720407373638.7150%22%2C%22tpl%22%3A%221_M%22%2C%22isList%22%3A0%2C%22show_items%22%3A%22%22%7D&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720405516.1&area=5_148_0_0&h5st=20240708105639436%3B5t9mgnyi65z65t97%3Bf06cc%3Btk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV%3B084f60286a2ba65bc260d4c7d3f7bdaa938a94c061fb3bda340d4e211057e575%3B4.7%3B1720407399436%3BVW3lXgDJ-h5kW0QIYbrMUU6F9AS1a8dAr6OGtN5S3KhBefMf4_l2V5bCBt-RPuEDdkv6E_Sti0yZeINs6J-Zs5Aq7rBLAmLrbvW3Jo6x3IqP_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSAV7CNYAAAAADCQFPXFHNCSEDUX

# URL解码:
# https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t=1720407890290&body={"keyword":"大地瓜","qrst":"1","wq":"大地瓜","stock":"1","pvid":"31f843077aa2407b95785a72340fe7e7","page":"10","s":"266","scrolling":"y","log_id":"1720407864619.8300","tpl":"1_M","isList":0,"show_items":""}&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720405516.1720407835.2&area=5_148_0_0&h5st=20240708110450292;5t9mgnyi65z65t97;f06cc;tk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV;704bed4f68c2ee25664115650234dcd9720fc69e33bbb0be7c2e4aa8b90d84db;4.7;1720407890292;VWQ64Dsf8LGqkpzRCzOQ0lP6zovZ-d9nI1SPesbrg2R6Xc9xh2X2O7UnaBZj7fFoer7TANkKb3zj0YV_8UO5MSpLxS9WfllrVxwRuBd53r2P_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSBGDWHIAAAAAC6MCL4YILLLO3QX
import asyncio
import logging
import time
from multiprocessing import Pool

import aiohttp
import aiomysql
import requests
from aiohttp import ContentTypeError
from lxml import etree

# 请求参数
# appid: search-pc-java
# functionId: pc_search_s_new
# client: pc
# clientVersion: 1.0.0
# t: 1720419837554
# body: {"keyword":"大地瓜","qrst":"1","wq":"大地瓜","stock":"1","pvid":"31f843077aa2407b95785a72340fe7e7","isList":0,"page":"3","s":"56","click":"0","log_id":"1720407889073.8949","show_items":""}
# loginType: 3
# uuid: 143920055.1720405515970186801854.1720405516.1720405516.1720407835.2
# area: 5_148_0_0
# h5st: 20240708142357556;5t9mgnyi65z65t97;f06cc;tk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV;76b0218808aa807ffb62c60238f484ba51bf66b05c4c975a3e5cf9d7aa316ea4;4.7;1720419837556;VelsisygYUGRe31NktYIIIyGn7JfAPpt-GDNwiz9Lgfc7TrTDDGgsXxz8ylQfb4N-lRHExGRcKi0OHa4W3Nvj8WhEbsZ2IL-0lzWJKmZEC_1_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3
# x-api-eid-token: jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSBGDWHIAAAAAC6MCL4YILLLO3QX

# 协程数量
CONCURRENCY = 2

# 配置logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s : %(message)s')


class Spider(object):
    def __init__(self):
        # 异步请求
        self.session = None
        # 设置协程数量
        self.semaphore = asyncio.Semaphore(CONCURRENCY)
        # 设置数据库连接池
        self.pool = None

    # 初始化数据库连接池
    async def init_pool(self):
        self.pool = await aiomysql.create_pool(
            host="127.0.0.1",
            port=3306,
            user="root",
            password="123456",
            db=f"jingdong",
            autocommit=True  # Ensure autocommit is set to True for aiomysql
        )
        # 在 aiomysql.create_pool 方法中,不需要显式传递 loop 参数。aiomysql 会自动使用当前的事件循环(即默认的全局事件循环)。

    # 关闭数据库连接池
    async def close_pool(self):
        if self.pool:
            self.pool.close()
            await self.pool.wait_closed()

    # 获取url源码
    async def scrape_api(self, url):
        # 控制协程数量
        async with self.semaphore:
            try:
                logging.info(f'scraping {url}')
                async with self.session.get(url) as response:
                    # 控制爬取速率
                    await asyncio.sleep(1)
                    # 返回源码
                    return await response.text()
            except ContentTypeError as e:
                logging.info(f'error ocurred while scraping {url}', exc_info=True)

    # 生成爬取链接
    def get_urls(self):
        # 事件戳
        t = time.time()
        s = 26
        urls = []
        for page in range(1, 7):
            if page == 1:
                url = f'https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t={t}&body={{"keyword":"大地瓜","qrst":"1","suggest":"1.his.0.0","wq":"大地瓜","stock":"1","pvid":"6557c5ae68374b69b120ffd848006147","page":"{page}","s":"1","scrolling":"y","log_id":"{t}","tpl":"1_M","isList":0,"show_items":""}}&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720407835.1720426267.3&area=5_148_172_34120&h5st=20240708170416741;5t9mgnyi65z65t97;f06cc;tk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV;428a1e9616a201717b6d47b2b3722bd452df43902900ff626a995e2227171c41;4.7;1720429456741;V_yUY6sbEL-CaIp_dxeIZ5pfbYaWIcva9sozBqHoAg6ZdawBS6lfLQF9JijQSlEpXWZqA93RjP5gIyjMqyv5EMtQXRKP83UlcFUcKqeFCVpE_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSGLBYGYAAAAADLIMW7XCPAQPEIX'
                urls.append(url)
            else:
                url = f'https://api.m.jd.com/?appid=search-pc-java&functionId=pc_search_s_new&client=pc&clientVersion=1.0.0&t={t}&body={{"keyword":"大地瓜","qrst":"1","suggest":"1.his.0.0","wq":"大地瓜","stock":"1","pvid":"6557c5ae68374b69b120ffd848006147","page":"{page}","s":"{s}","scrolling":"y","log_id":"{t}","tpl":"1_M","isList":0,"show_items":""}}&loginType=3&uuid=143920055.1720405515970186801854.1720405516.1720407835.1720426267.3&area=5_148_172_34120&h5st=20240708170416741;5t9mgnyi65z65t97;f06cc;tk03wb0d01bf518n32iEIGovakaKKTQNowDzR6awWYSGFUuBxwxdqI7rI5zMMEIKtbV1MlLFKx401IvhD1hSA4wpk2GV;428a1e9616a201717b6d47b2b3722bd452df43902900ff626a995e2227171c41;4.7;1720429456741;V_yUY6sbEL-CaIp_dxeIZ5pfbYaWIcva9sozBqHoAg6ZdawBS6lfLQF9JijQSlEpXWZqA93RjP5gIyjMqyv5EMtQXRKP83UlcFUcKqeFCVpE_W44FIX_aU0T-VVaCT05fg_EXF-1PNNZBqaXHYx0BYl8zZ19Td0vLFIjaL01RMbD07ZObz8tLp2Jn3DvJml1suhfTZ89y7tuV7ItBhR7lbKKWByWlB8XRWGSg9XsxxB0HAZMcIUdNrylVgeML9uj9b5qTJ29X23MbI-a8LskjhzRdae8sApBdXTmGY5EmOB90339yeyK6rAKBoUnwtPtsr41wX1NpBGgHFQ9KUYCwLL8fY_p5xIpUsrVxOLCu7nZggE7nDk8PeheJO0dl8zjLad9Prk3hGJ0DQIeqffFGvzEemLTD52YgeDqWQHLXbk3&x-api-eid-token=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSGLBYGYAAAAADLIMW7XCPAQPEIX'
                urls.append(url)
                s += 30
        return urls

    # main方法: 爬取前3页商品信息
    async def main(self):
        # 设置headers头
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
            'Origin': 'https://search.jd.com',
            # 'Referer': 'https: // search.jd.com /',
            # 'authority': 'api.m.jd.com'
            'Cookie': '__jdv=76161171|www.google.com|-|referral|-|1720405515971; __jdu=1720405515970186801854; areaId=5; ipLoc-djd=5-148-0-0; PCSYCityID=CN_130000_130400_0; shshshfpa=b91bb682-fc61-e92f-d4fa-6f2ab9016096-1720405517; shshshfpx=b91bb682-fc61-e92f-d4fa-6f2ab9016096-1720405517; _pst=jd_JEmItmZNMVjZ; unick=jd_19k0x9pj4djw61; pin=jd_JEmItmZNMVjZ; _tp=x6GXi%2BJW3x6n83VWqaBHsA%3D%3D; pinId=4zoL-jyJrsYADYzmUaUUbg; jsavif=1; jcap_dvzw_fp=D98Z_LJ6_qpQNHV4YAypinlsUHSWxNmfTaynt0Zdv_UsqMZp3i5Qpdsn0w4iLfi6Puo7XtYZf66yNMh-MoPnwg==; unpl=JF8EAJ5nNSttDEhXUh4BT0UUQ1tUW1sPTR9Ra2JRVQ1eTFVSSAIYQkR7XlVdWBRKER9ubhRUWlNLVg4aBisSEHtdVV9eDkIQAmthNWRVUCVXSBtsGHwQBhAZbl4IexcCX2cDV1xdSlABGwYTFBFLVFNXXAhCEwZfZjVUW2h7ZAQrAysTIAAzVRNdDkgWBm5jAVRZUE1VBRIFEhMQQllRblw4SA; 3AB9D23F7A4B3CSS=jdd03V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4AAAAMQSFTVBHQAAAAAC4B6QWJAGOWDJ4X; _gia_d=1; mba_muid=1720405515970186801854; mba_sid=17204263891663834439004533139.1; wlfstk_smdl=cvb04qz3roc5w60cs5zyx6jif17m2plf; logintype=wx; npin=jd_JEmItmZNMVjZ; thor=60B98CF932ADA526FE25802592C19B87D6CE84D9F3BE42CF89AA8311226A742764C1877F1DF3A80F050EC47B4A640838398AACBCEF5DAB9D905EE6050D9280A3DAF6B0C4C1046A9CF34BC2A309838E2D3EF1A02A02BAC5B5D059837D42CE1D0EBB6A877CB19B0142334418A26D7781EB79EE6AF17DB9031460A688639BF795EBA9A05B0BA147282060EF0C8CD1A49E53640ABCC266E9EAA802150EEAB7097923; flash=2_2XfyufGunTsB2yv6828kN5vMWLQatIHqpk1syw5siK-0q3Jn3XMiJ0mUCKH3K-YnrxBL0OKhvHK_nTbXbhP5DfKN0yEorcr34im6PSWjEjl9pstRvn3LitJI2FykXYej-Qgtus5ZXnCFuSowFYxceVWG0GdRbiqx9dVtieXi41P*; __jda=143920055.1720405515970186801854.1720405516.1720407835.1720426267.3; __jdc=143920055; shshshfpb=BApXcuRtvkvVA7RiSDiYHm2YzBl0Cx9FMBmIBUjho9xJ1MqB1loC2; 3AB9D23F7A4B3C9B=V5VWE343CHG6ZQ5SWNXMF26W56JIB22JTI3USWGZ62J6GCA43ERHGX2X3VC7A2CC52WLPNXLZGK6XH6BBTWT7G3ET4; __jdb=143920055.19.1720405515970186801854|3.1720426267'
        }
        # 建立异步请求需要的session(用于加headers、代理IP、cookie等信息)
        self.session = aiohttp.ClientSession(headers=headers)
        # 获取urls
        urls = self.get_urls()
        # 生成任务列表
        tasks = [asyncio.ensure_future(self.get_prices(url)) for url in urls]
        # 获取结果
        results = await asyncio.gather(*tasks)
        # 入库
        # 1.初始化数据库连接池
        await self.init_pool()
        # 2.保存至数据库
        [await self.save_to_mysql('prices', 'price', tuple(price)) for page_result in results for price in page_result]
        # 关闭数据库连接池
        await self.close_pool()
        # 关闭连接
        await self.session.close()

    # 入库
    async def save_to_mysql(self, table_name, table_column_str, table_info_str):
        async with self.pool.acquire() as conn:
            async with conn.cursor() as cursor:
                sql = f'insert into {table_name}({table_column_str}) values{table_info_str}'
                # 执行SQL语句
                await cursor.execute(sql)
                await conn.commit()

    # 获取商品价格
    async def get_prices(self, url):
        # 获取url源码
        source = await self.scrape_api(url)
        # print(source)
        # xpath解析数据
        prices = etree.HTML(source).xpath('//div[@class="p-price"]/i/text()')
        return prices


if __name__ == '__main__':
    # 初始化spider
    spider = Spider()
    # 创建事件循环池
    loop = asyncio.get_event_loop()
    # 注册事件
    loop.run_until_complete(spider.main())

优化策略采取

  1. CPU核心与进程数量的关系
    • 如果你的CPU有多个核心,并且你有大量的CPU密集型任务需要处理,使用多进程可以充分利用每个核心,实现真正的并行计算。这对于处理大量计算密集型任务是非常有效的。
  2. 异步网络请求的特性
    • 异步网络请求(例如使用asyncioaiohttp)主要是针对IO密集型任务设计的。在这种情况下,多个网络请求可以并行执行,但每个请求本身在CPU上的计算量通常较小,因为大部分时间都是在等待响应返回。
    • 异步网络请求的并发性能主要受网络带宽、对方服务器响应速度等因素的影响,而不是CPU核心数量的限制。
  3. 综合考虑
    • 如果你有大量的CPU密集型计算任务,而且每个任务的计算量很大,那么使用多进程可以显著提高整体的处理速度,因为每个进程可以在不同的CPU核心上并行执行。
    • 如果主要是处理大量的网络请求,并且这些请求主要是IO密集型,那么使用异步IO和事件循环(如asyncio)会更适合,因为它可以在单个线程内高效地管理大量并发的IO操作,而不需要引入多进程的复杂性。

结论

  • 在处理大量的网络请求时,通常使用异步IO(如asyncioaiohttp)能够提供很好的性能,因为它可以在单个线程内高效地管理并发的IO操作。
  • 如果同时有大量的CPU密集型计算任务,并且有多个CPU核心可用,考虑使用多进程可以有效利用多核心资源,加速整体处理速度。
  • 综合考虑任务的特性和系统的资源,选择合适的并发模型可以最大化性能提升。

后续发布爬虫更多精致内容(按某培训机构爬虫课程顺序发布,欢迎关注后续发布)

更多精致内容

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1923398.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Java单边表的局部翻转

反转链表 II 这是上一个翻转全部链表的进阶版&#xff0c;大家可以先去看我的上一篇博客 Java算法之单链表的全部翻转-CSDN博客 题目描述 给你单链表的头指针 head 和两个整数 left 和 right &#xff0c;其中 left < right 。请你反转从位置 left 到位置 right 的链表节…

应急响应总结

应急响应 日志 windows IIS 6.0 及更早版本&#xff1a; C:\WINDOWS\system32\LogFiles\W3SVC[SiteID]\ IIS 7.0 及更高版本&#xff1a; C:\inetpub\logs\LogFiles\W3SVC[SiteID]\ Apache HTTP Server C:\Program Files (x86)\Apache Group\Apache2\logs\ 或者 C:\Prog…

推荐一款 uniapp Vaptcha 手势验证码插件

插件地址&#xff1a;VAPTCHA手势验证码 - DCloud 插件市场 具体使用方式可访问插件地址自行查阅

韦东山嵌入式linux系列-实现读LED状态的功能

这是第五篇第5章的课后作业&#xff0c;尝试实现 实现读 LED 状态的功能&#xff1a;涉及 APP 和驱动。 1 LED 驱动能支持多个板子的基础&#xff1a; 分层思想 参考分层思想 ①把驱动拆分为通用的框架(leddrv.c)、具体的硬件操作(board_X.c)&#xff1a; ②以面向对象的思想…

Vue3 引入Vanta.js使用

能搜到这篇文章 想必一定看过demo效果图了吧 示例 Vanta.js - Animated 3D Backgrounds For Your Website (vantajs.com) 1. 引入 在根目录 index.html中引入依赖 <script src"https://cdnjs.cloudflare.com/ajax/libs/three.js/r134/three.min.js"></sc…

jenkins系列-02.配置jenkins

首先&#xff1a;我们要给jenkins配备jdkmaven: 从上一节我们知道 ~/dockerV/jenkins/jenkins/data目录 就是 容器中jenkins的home目录 所以把jdkmaven 放在当前宿主机上的 ~/dockerV/jenkins/jenkins/data目录下即可 容器内&#xff1a; 开始配置jenkins: 注意是在jenkins…

CSIP-FTE考试专业题

靶场下载链接&#xff1a; https://pan.baidu.com/s/1ce1Kk0hSYlxrUoRTnNsiKA?pwdha1x pte-2003密码&#xff1a;admin123 centos:root admin123 解压密码&#xff1a; PTE考试专用 下载好后直接用vmware打开&#xff0c;有两个靶机&#xff0c;一个是基础题&#x…

树莓派PICO使用INA226测量电流和总线电压(2)

上一篇文章里&#xff0c;我们讲了如何设置配置寄存器&#xff08;0x01&#xff09;&#xff0c;在测量电流之前&#xff0c;还需要设置校准寄存器&#xff08;0x05&#xff09;&#xff0c;校准寄存器非常关键&#xff0c;如果不设置这个寄存器&#xff0c;INA226是不会工作的…

同时用到,网页,java程序,数据库的web小应用

具体实现功能&#xff1a;通过网页传输添加用户的请求&#xff0c;需要通过JDBC来向 MySql 添加一个用户数据 第一步&#xff0c;部署所有需要用到的工具 IDEA(2021.1),Tomcat(9),谷歌浏览器&#xff0c;MySql,jdk(17) 第二步&#xff0c;创建java项目&#xff0c;提前部署数…

keil无法读取jlink的一个原因——使用jlink的Vout引脚给芯片供电

keil无法读取jlink的一个原因——使用jlink的Vout引脚给芯片供电 问题背景问题排查 问题背景 在使用 J-Link 对 GD32F470VGT6 进行程序烧录时&#xff0c;遇到下载失败且 J-Link 未能识别设备的问题。 通过检查设备管理器确认 J-Link 驱动已正确安装。 问题排查 对照jil…

免开steam 脱离steam 进行游戏的小工具

链接&#xff1a;https://pan.baidu.com/s/1k2C8b4jEqKIGLtLZp8YCgA?pwd6666 提取码&#xff1a;6666 我们只需选择游戏根目录 然后输入AppID 点击底部按钮 进行就可以了 关于AppID在&#xff1a;

OceanBase:引领下一代分布式数据库技术的前沿

OceanBase的基本概念 定义和特点 OceanBase是一款由蚂蚁金服开发的分布式关系数据库系统&#xff0c;旨在提供高性能、高可用性和强一致性的数据库服务。它结合了关系数据库和分布式系统的优势&#xff0c;适用于大规模数据处理和高并发业务场景。其核心特点包括&#xff1a; …

3分钟搞定Kali Linux安装,超详细教程(附安装包)

**今天写一写Kali渗透中的第一个知识点&#xff1a;Kali安装配置。 俗话说得好&#xff1a;kali学得好&#xff0c;牢饭吃到饱&#xff01;** 相信很多同学在刚接触网络安全的时候&#xff0c;都听过kali linux的大名&#xff0c;那到底什么是kali&#xff0c;初学者用kali能做…

请编写程序,利用malloc函数开辟动态存储单元,存放输入的三个整数,然后按从小到大的顺序输出这三个数

int main() {int* nums;nums (int*)malloc(3 * sizeof(int));if (nums NULL){perror("error:");exit(1);}printf("请输入三个整数\n");int i 0;for (i 0; i < 3; i){scanf("%d", &nums[i]);}printf("请输入的三个整数为\n"…

控制单/多用户权限

多用户权限控制 Unix/类Unix是一个多用户的操作系统&#xff0c;拥有众多的发行版系统。单一用户可以使用chmod命令修改可读可写可执行权限。多用户使用chmod就显得力不从心了。多用户操作权限则使用ACL规则(Access Control List)&#xff0c;即访问控制列表&#xff0c;ACL规则…

数据库的学习(6)

题目&#xff1a; 数据准备创建两张表:部门(dept)和员工(emp)&#xff0c;并插入数据&#xff0c;代码如下create table dept(dept_id int primary key auto_increment comment 部门编号,dept_name char(20)comment部门名称 ); insert into dept (dept_name) values(销售部),(财…

新手-前端生态

文章目录 新手的前端生态一、概念的理解1、脚手架2、组件 二、基础知识1、HTML2、css3、JavaScript 三、主流框架vue3框架 四、 工具&#xff08;特定框架&#xff09;1、uinapp 五、组件库&#xff08;&#xff09;1、uView如何在哪项目中导入uView 六、应用&#xff08;各种应…

Wavlink 路由器攻击链

本文仅用于技术研究学习&#xff0c;请遵守相关法律&#xff0c;禁止使用本文所提及的相关技术开展非法攻击行为&#xff0c;由于传播、利用本文所提供的信息而造成任何不良后果及损失&#xff0c;与本账号及作者无关。 本文来源无问社区&#xff0c;更多实战内容&#xff0c;…

手把手教你打数学建模国赛!!!第一天软件准备篇

第一天软件准备 MATLAB MATLAB&#xff08;Matrix Laboratory&#xff09;是一种强大的数值计算和科学编程软件。它提供了丰富的数学函数和工具&#xff0c;用于数据分析、算法开发、信号处理、图像处理、控制系统设计、仿真等应用领域。 MATLAB具有直观的语法&#xff0c;使…

SAP HCM 定额扣减不生效问题,从定位错误到玩转配置

导读 INTRODUCTION 定额扣减:今天遇到一个很奇怪的问题,就是年假不会扣减年假定额的问题,认真去查看相关配置,但是一直没找到为什么不触发扣减规则,这次出现的问题还是触发规则的问题,触发规则主要这么几类、星期、假期类、日类型、期间工作日程表的技术类、日工作计划类…