如何设置爬虫的访问频率？

news2026/2/18 18:46:43

设置爬虫的访问频率是确保爬虫行为合法且不给目标网站服务器造成过大压力的重要措施。合理的访问频率可以有效避免被网站封禁IP，同时也能保证爬虫的效率。以下是一些设置爬虫访问频率的方法和策略：

一、设置请求间隔

（一）固定间隔

在每次请求之间设置固定的间隔时间，确保不会对目标网站造成过大压力。例如，设置每次请求间隔为1-3秒。

示例代码：

import requests
import time

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    return response.text

def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    for url in urls:
        html = get_html(url)
        # 处理网页内容
        print(html)
        time.sleep(2)  # 设置每次请求间隔为2秒

if __name__ == "__main__":
    main()

（二）随机间隔

为了避免被简单的反爬机制识别，可以设置随机的请求间隔。例如，每次请求间隔在1-3秒之间随机选择。

示例代码：

import requests
import time
import random

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    return response.text

def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    for url in urls:
        html = get_html(url)
        # 处理网页内容
        print(html)
        time.sleep(random.uniform(1, 3))  # 设置随机请求间隔

if __name__ == "__main__":
    main()

二、控制并发请求

（一）限制并发数量

在多线程或多进程爬虫中，限制并发请求的数量可以有效避免对目标网站造成过大压力。例如，使用concurrent.futures模块限制最大并发数。

示例代码：

import requests
import concurrent.futures
import time

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    return response.text

def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    max_workers = 3  # 设置最大并发数为3
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = [executor.submit(get_html, url) for url in urls]
        for future in concurrent.futures.as_completed(futures):
            html = future.result()
            # 处理网页内容
            print(html)
            time.sleep(1)  # 设置每次请求间隔为1秒

if __name__ == "__main__":
    main()

三、动态调整请求频率

（一）根据响应状态动态调整

根据目标网站的响应状态动态调整请求频率。如果响应状态码为200，可以保持当前频率；如果响应状态码为429（Too Many Requests），则降低请求频率。

示例代码：

import requests
import time

def get_html(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    return response

def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    for url in urls:
        response = get_html(url)
        if response.status_code == 200:
            # 处理网页内容
            print(response.text)
        elif response.status_code == 429:
            print("Too Many Requests, reducing request frequency")
            time.sleep(5)  # 增加请求间隔
        time.sleep(2)  # 设置每次请求间隔为2秒

if __name__ == "__main__":
    main()

四、使用代理IP

使用代理IP可以分散请求来源，避免因单一IP频繁访问而被封禁。可以通过代理服务提供商获取动态代理IP，并在爬虫中使用。

示例代码：

import requests
import time

def get_html(url, proxy=None):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    proxies = {
        'http': proxy,
        'https': proxy
    }
    response = requests.get(url, headers=headers, proxies=proxies)
    return response.text

def main():
    urls = ["https://example.com/page1", "https://example.com/page2", "https://example.com/page3"]
    proxy_list = ["http://proxy1.example.com:8080", "http://proxy2.example.com:8080"]
    for url in urls:
        proxy = random.choice(proxy_list)
        html = get_html(url, proxy)
        # 处理网页内容
        print(html)
        time.sleep(2)  # 设置每次请求间隔为2秒

if __name__ == "__main__":
    main()